Big Data Caveats, Front and Center
I guess I shouldn’t have been surprised by Nassim Nicholas Taleb’s recent Wired article “Beware the Big Errors of ‘Big Data’.” Since 2004, the derivatives trader turned philosopher has published a trilogy of highly-entertaining and provocative books, “Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets,” “The Black Swan: The Impact of the Highly Improbable,” and “Antifragile: Things That Gain from Disorder,” on the perils of modeling and prediction in today’s world.
In “Fooled by Randomness,” Taleb assails the financial services community for its dumb luck, its hubris and its reckless conduct. The book’s point of departure is that the human brain sees the world as less random, and conversely, more well-behaved than it actually is. We often mistake pure luck for skill, elevating lucky fools to guru status. We’re wired for certainty, determinism and causality, even when they don't exist. We think linearly, continuously and symmetrically, elevating the bell curve to religious status.
“The Black Swan” addresses what often is the folly of trying to predict the future from the past. Taleb argues that truly momentous events such as September 11 and the 2008 real estate crash are unpredictable – and that trying to explain them is a fool’s errand. Progress is non-linear and “un” bell-shaped.
Black swans, those events that are outliers, carry extreme impact, and are not predicted - but nevertheless explained post facto – are of much higher importance than we’d like to think. Taleb contrasts two types of randomness, the utopian Mediocristan, which is close to equality, and behaves according to the bell curve with continuous progression; and the winner-take-all Extremistan, with extensive skewness in population values and progression in jumps. Mediocristan, typified by a population height distribution or IQ, is impervious to black swans. Extremistan, illustrated by population wealth distributions or business company size, is vulnerable to black swans. Alas, our lives are much influenced by Extremistan – just as they are by randomness.
If “Fooled by Randomness” and “The Black Swan” articulate the problem, “Antifragile” proposes a solution. Starting with the taxonomy “Triad” of Fragile, Robust and Antifragile, Taleb argues that the Fragile demands tranquility while the Antifragile thrives on disorder. In an Antifragile world, randomness and uncertainty are central and welcome, mistakes small and benign. From a scientific lens, Fragile is directed research; Antifragile is stochastic tinkering. Indeed, Antifragile benefits from random events and prospers from experiments – “loves mistakes” – in lieu of formal education. If you accept Taleb’s premise that black swans dominate history, “as a consequence, we don’t quite know what’s going on, particularly under sever nonlinearities; so we can get to practical business right away.” In short, Fragile embraces top-down planning and prediction, while Antifragile gravitates to bottom-up, experimental learning.
Given that backdrop, Taleb’s misgivings on big data and analytics aren’t at all surprising: “We’re more fooled by noise than ever before, and it’s because of a nasty phenomenon called “big data.” With big data, researchers have brought cherry-picking to an industrial level … Modernity provides too many variables, but too little data per variable. So the spurious relationships grow much, much faster than real information … In other words: Big data may mean more information, but it also means more false information … In observational studies, statistical relationships are examined on the researcher’s computer. In double-blind cohort experiments, however, information is extracted in a way that mimics real life … This is not all bad news though: If such studies cannot be used to confirm, they can be effectively used to debunk — to tell us what’s wrong with a theory, not whether a theory is right.”
Taleb is nothing if not a lightning rod, dismissed by many as a pompous doom-and-gloomer. Don’t count me among the naysayers. In fact, I’m comforted by Taleb’s attacks on our tidy notions of predictability, linearity, cause and effect thinking, risk management and top-down planning that bring to mind the mistaken “common sense” that’s often the topic of this blog.
Yes, he is somewhat pretentious – “But academics (particularly in social science) seem to distrust each other; they live in petty obsessions, envy and icy-cold hatreds, with small snubs developing into grudges, fossilized over time in the loneliness of the transaction with a computer screen and the immutability of their environment. Not to mention a level of envy I have almost never seen in business…”“– but Taleb has a lot of important things to say, and recent history has validated his concerns.
Taleb’s big data caveats should be front and center to data science. He’s spot on when he cautions about the dangers of spurious relationships spawned by wide observational data. And he’s also right that it takes strong designs like the randomized experiment to protect big data analytics from methodological assault. Analysts could do a lot worse than adopt Taleb’s conservative credo that non-experimental data is best deployed to discredit rather than confirm business theories. I would just add that observational data can also be explored to propose new hypotheses for subsequent experimental testing.