If you work in the analytics world, you’ve probably either read or at least heard of the seminal book "Big Data: A Revolution That Will Transform How We Live, Work and Think," published earlier this year. Authors Viktor Mayer-Schonberger and Kenneth Cukier’s provocative point of departure is that the new data norm of N=all and a tolerance for simple correlation over causation is changing the analytics landscape, obviating the need for much of traditional statistical analyses.
Not so fast has been the response of many. Harvard professor and big data advocate Gary King argues that it’s as much about analytical methods and research design as the data. Algorithms and methodology are just as important as data quantity. “The trick is to make yourself vulnerable to being proven wrong as many times as possible.”
And so there’s emerged a debate in the industry about the relative importance of ever bigger data versus ever better predictive models. My sense is that the work world generally pines for more data, while academia, consumed with causal theories, looks more to research design and the latest algorithms.
Me? I’m conciliatory: both bigger data and better models embellished by clever designs -- can have a significant impact on the success of analytics initiatives.
For data, “big” is to be interpreted both in terms of the “N” of cases and also the number of features or attributes “p”. My experience is that often one of the quickest ways to enhance the explanatory power of predictive models is to find additional attributes that link to the core data. The new attributes can often help significantly, especially if they’re both correlated with the responses of interest and unrelated to the original features.
Increasing N, the sample size, can also be quite helpful. Larger N can reduce the sampling variation of model estimates, potentially increasing the signal to noise ratio.
And larger N can paradoxically drive more efficient computation. With a big data set, it’s feasible to randomly partition into train, tune and test components for model training and testing. With smaller N. in contrast, the analyst must use techniques such as cross-validation and the bootstrap to protect against overfitting. In tandem with modeling algorithms that resample, the smaller data techniques can be quite computationally expensive.
Larger N is not always a panacea, however. The data generation designs of many analytics endeavors are inherently biased, so more data can in some cases simply lead to larger bias. Surveys by analytics vendors that purport to show demand for their products and services are illustrations, as are studies on the analytics marketplace by consultancies paid by vendors featured in the research. In these instances, there is a clear selection bias clouding interpretation of the findings. At a minimum, it’s the responsibility of these “studies” to clearly demonstrate their respondents are “representative” of the market they’re purporting to summarize.
On the modeling side, statistical learning techniques that include spline-based, regularization and resampling learning algorithms routinely predict far more accurately on test data than does regression with linear inputs. Those differences are highlighted in an excellent tutorial on The Evolution of Regression Modeling by Salford Systems. Indeed, I’m more and more a data-driven than a theory-based modeler now. I wish I could say the same for many opinionated macroeconomics policy wonks!
Predictive models can also be used to support the design of data generation ex post facto. Consider an example of comparing over time the responses of “treatment” and “control” groups that weren’t generated randomly. Without randomization, there’s always the specter that the groups systematically differ on one or more attributes out of the gate, and it’s those attributes rather than ones of interest that’re impacting the response. Techniques such as propensity modeling have evolved over the years to statistically “equate” the non-randomized groups on potentially confounding variables so that treatment and control responses have minimal bias.
All told, I believe both more data and better algorithms, if used judiciously, can be a boon for analytics. But I reject the notion that in our big data world, correlation is as good as causation. For me the design of data collection to demonstrate cause and effect is every bit as important as N and algorithms -- and must be central to any methodology.
Randomization to treatment/control is the gold standard. Even without randomization, however, there’s plenty that can be done to minimize bias. A time series design with non-equivalent treatment and control groups can be quite effective, especially when coupled with statistical adjustment techniques to minimize bias.
Bigger Data, Better Models and Better Data Collection Designs.