I’m a big fan of O'Reilly Media author Mike Loukides. Loukides’ substantive writings are clearly distinguished among those from the now-too-many big data and analytics talking heads. I don’t think I’ve seen a better article on the foundations of data science than his seminal piece from a few years back.
So I was anxious to read his latest articles on data skepticism. Skepticism is generally considered among the most important qualities of a data scientist, mentioned in the same breath as deep skills in data programming/wrangling, statistics/machine learning/optimization, visualization/infographics, business acumen and off-the-charts curiosity.
So just what is data skepticism? And why does it seem to have become even more prominent following the publication of the best-seller, “Big Data: A Revolution That Will Transform How We Live, Work and Think”?
Many data science practitioners, myself and Loukides included, are uncomfortable with the central big data “revolution” tenet that, with big data, correlation is as good as causation. The scientific method that most data scientists are trained in posits that causation has a much deeper burden of proof than correlation. If two variables, A and B, are correlated, A may cause B, B may cause A, both A and B might be caused by a third variable C, or the relationship might be a statistical fluke. In short, data skeptics just don’t buy that correlation is sufficient.
Though causation’s impossible to prove, it’s the data scientist’s responsibility to “build a story around the data” to demonstrate its case. An interesting practical implication of the correlation-causation dilemma, noted by industry luminary Cathy O’Neil, is the risk that putting machine learning techniques in the hands of analytics initiates may be dangerous, since the newbies might be too credulous with the findings, drawing inference from spurious correlations that can lead business astray. I bet many new easy-out big data vendors hope that thought doesn’t gain traction!
My take on data skepticism? It’s an approach with a relentless methodological attack on a proposed theory. The null hypothesis is that the theory is incorrect, and it’s data scientist’s skeptic responsibility to “prove” the naysayers wrong. Her charter is to systematically eliminate competing theories consistent with the data. The main tools at the skeptic’s disposal: experimental and quasi-experimental research designs, a highly-recommended reference being “Experimental and Quasi-Experimental Designs for Generalized Causal Inference,” by Shadish, Cook and Campbell.
Clever experimental and quasi-experimental designs can go a long way toward mitigating sources of bias and threats to validity in analytics investigations. As an illustration, consider a correlational assessment following an organizational intervention. If X represents the intervention and O the subsequent observational measurement, then this simple schema would look like the following:
Unfortunately, this design is fraught with problems, not the least of which is the inability to assess the “counterfactual” of no intervention (~X). It’s almost impossible to interpret the O’s without a basis of comparison such as:
or at least:
O X O
Since analytics rarely observes both factuals and counterfactuals with the same “subjects,” the randomized experiment makes for a convenient proxy – the treatment groups representing the factuals, the controls the counterfactuals. Properly executed, random assignment to “treatment” and control can assure groups are initially “similar” on outside factors within the limits of probability theory. Because potentially biasing variables should be equal among the experimental groups, the skeptic can then be more comfortable imputing cause and effect when significant measurement differences surface between treatment and control:
R X O
R ~X O
The business world is increasingly deploying randomized experiments to divine new ways of optimizing operations. Users of Google, Amazon and Yahoo! are routinely (and unwittingly) guinea pigs for new website feature testing. Capital One and Harrah’s are also obsessed with experimentation to drive product marketing. Even economists have embraced the experimental method!
Where randomized experiments are impractical, the data skeptic has other prospective arrows in her methodological quiver. OpenBI has just embarked on an outcomes evaluation engagement with a health care customer. Our “optimal” design would be an experiment where customers are randomly assigned to treatment and control. Alas, that isn’t feasible, but we have been able to identify a comparison group, albeit a non-randomized one, and collect multiple measurements both pre and post intervention. The idealized “time series with non-equivalent control group design” (TSNECG) is schematized as follows.
O O O X O O O O O
O O O ~X O O O O O
While not as skeptic-friendly as the experiment we desired, the TSNECG does help evaluate competing hypotheses. The “pre” observations of both groups will calibrate measurement for the post. If the post measurements of the treatment group have a different trajectory than the pre and also appear distinct from the “control,” the prudent data scientist can at least entertain the thought that treatment had impact.
Skepticism is indeed at the core of data science. And it’s incumbent on DS practitioners to take seriously designs and methods to make skeptics comfortable moving from correlation to causation.