Predictive Trend? Or Drunkard's Walk?
One of my colleagues recently asked me if I'd read “The Drunkard's Walk, How Randomness Rules Our Lives,” by Leonard Mlodinow.
Not only had I read the book a few years back, I responded, I loved it and subsequently wrote a couple of blogs on its content. In fact, I believe DW is one of the most important books out there for BI and analytics.
In addition to regaling readers with an entertaining look at probability and statistics through the ages, Mlodinow wants us to appreciate the randomness that rules our lives. Before searching for elaborate ex-post explanations of phenomena, he warns, look for simple random explanations first. The null hypothesis should be randomness rather than predictable behavior.
Indeed, Mlodinow might be considered a predictive analytics contrarian, seeing randomness where modelers see patterns. “In the scientific study of random processes, the drunkard's walk is the archetype … we're continually nudged in this direction and then that one by random events. As a result, although statistical regularities can be found in social data, the future of particular individuals is impossible to predict.”
His explanation for life's twists and turns is more subtle than the smoking gun analysts often search for “in all except the simplest real-life endeavors, unforeseeable or unpredictable force cannot be avoided, and moreover those random forces and our reactions to them account for much of what constitutes our particular path in life ... the future is really chaotic and unpredictable.”
A drunkard's walk describes random motion akin to an inebriate stumbling home from the tavern. Each new step is independent of those that preceded. Statisticians often use the phrase random walk to describe this movement, where tomorrow's position is simply today's plus a random shock.
Yet with all the drunkard walk's randomness, patterns may manifest in data that on the surface appear very predictable, even when they're not. And it's incumbent on predictive analysts to separate that noise from a real signal. While it's certainly bad not to see a pattern when there in fact is one (type II error), an even bigger sin for statisticians is to declare a pattern that's really just noise (type I). An important message of “The Drunkard's Walk” is that we must be particularly attentive to type I error, because randomness can easily fool us as we look for patterns that don't exist.
Consider the 1,000 time series points in Figure 1.
These certainly appear to be increasing, even with spiky variation and non-linearity. My first inclination would be to fit a curve to these data.
And indeed, Figure 2 shows the results of a estimated cubic spline regression. For starters, the model appears to do a pretty decent job with the data.
Alas, now look at Figure 3, which shows the same data/curve of Figure 2 with the addition of the next 1000 observations from the data set. What a difference these observations make! The pattern of Figure 2 simply evaporates. And I'm hard pressed to find any signal in the combined observations.
As well I shouldn't, either. Where did the data originate? An R-based simulation of a random or drunkard's walk. Predictive modelers beware. The drunk may walk straight for a period of time, giving the illusion of sobriety, just by chance. But he's still tipsy and will ultimately revert to stumbling. Just give him time.