Two years ago, Wired editor in chief and Long Tail author Chris Anderson wrote a provocative article entitled The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. The article's message was that in the era of petabyte-scale data, the traditional scientific method of hypothesize, model, test is becoming obsolete, the victim of the combination of huge data volumes and the computer capacity to process them. “There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.” Not simply content to take on the scientific establishment, Anderson seemed to go after mainstream statistics as well: “At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics.”
The university types I know – mostly statisticians, social scientists and business professors – reacted viscerally to Anderson's words. “Credulous” was the mildest rebuke in a line that included “bogus” and “bull----”. Three sample rejoinders: 1) Algorithms are useless on their own. In any data analysis, whether it is formal analysis or exploratory investigation, a key issue is how widely the results apply. Algorithms give no indication how or if results might generalize. 2) The neural net algorithms from psychology that were first promoted as a modeling panacea are now acknowledged as more limited. 3) Look even at the small yield of the massive data mining taking place in the national intelligence community (e.g., by massive screening of unselected phone call and e-mail data with dubious support from the Constitution). My bet is that “old style” human intelligence is far from obsolete in national intelligence.
At first I thought the divide might simply be business versus academia. But then I came across the late UC Berkeley statistician Leo Breiman's Statistical Modeling: The Two Cultures, where a renowned university professor turned business consultant returned university professor took on the statistical mainstream with words similar to Anderson's. Breiman's beef with statistics is that it's become consumed with mathematical models of how the world behaves that have “led to irrelevant theory, questionable conclusions, and have kept statisticians from working on a large range of interesting current problems.” Addressing the same algorithmic models referenced in the End of Theory, Breiman concludes: “If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.”
The originator of the widely-used business analytical models Classification and Regression Trees (CART) and Random Forests, Breiman offered this practical wisdom to the statistical world: “I left consulting to go back to the university, these were the perceptions I had about working with data to find answers to problems:
(a) Focus on finding a good solution – that’s what consultants get paid for.
(b) Live with the data before you plunge into modeling.
(c) Search for a model that gives a good solution, either algorithmic or data.
(d) Predictive accuracy on test sets is the criterion for how good the model is.
(e) Computers are an indispensable partner.”
The more I thought about Anderson/Breiman versus mainstream statistics the more I began to recognize that rather than business practitioners versus academics, this was a case of different uses of data analysis in science and business. Scientists are testing theories, looking for explanations of cause and effect, while business analysts are generally obsessed with prediction, content with correlation at the expense of causality. In the business world, models that offer superior predictions of fraud and churn are preferred over those that might better “explain” the behavior.
Academic statistician/social scientist Richard Berk attempts to reconcile the correlational and causal camps in his book Statistical Learning from a Regression Perspective. Berks proposes four distinct uses or “stories” for data analysis in the research and business worlds. First is a Causal Story, where an underlying statistical model is assumed and the purpose of modeling is causal explanation. Second is a Conditional Distribution Story that departs with an underlying model without a causal perspective. Third is a Data Summary Story which unapologetically uses statistical learning algorithms in the absence of a model to find interesting relationships in data. Last is a Forecasting Story that looks to construct functions to make predictions from data. The causal and conditional distribution stories seem to be from traditional statistical thinking that has explanation as its goal, while the data summary and forecasting stories are more predictive and correlational – consistent with Anderson's perspective.
My take is that both the causal/explanatory and correlation/prediction vantage points are important for business analytics. I use statistical learning or algorithmic models more and more in my work, generally finding them better than regression for the large-scale problems I encounter. And with noisy observational data, these “models” often make superior predictions. At the same time, I like to have a solid design underlying the data generation process to inspire confidence in the validity of the findings. The closer that design looks to a randomized experiment, the more comfortable I feel. Maybe I'm naïve, but I see no reason why clever analysts can't have the best of both correlational and causal worlds.
Steve Miller also blogs at Miller.OpenBI.com.