Several recent MIT Sloan Management Review Data & Analytics Blog articles call for restraint in anointing big data the cure-all for business ills, citing substantial biases that must first be overcome. In a delightful and highly-informative O’Reilly Strata Conference presentation, MIT Media Lab’s Kate Crawford, warns of “algorithmic illusions” with big data: “…Biases in data collection, both in how it’s prepared and cognitively; exclusions, or gaps, in data signals whereby some people are not represented by data; and the constant need for context in conclusions, whereby small data — asking people how and why, and not just how many — tells a better story than big data.” Her antidote: combine big “data together with small data — computational, social science along with traditional qualitative methods.” Sounds a lot like Gary King’s proposed merger of qualitative and quantitative research in social science.
UNC professor Zeynep Tufekci piles on the big data bias wagon in her article: Big Data: Pitfalls, Methods and Concepts for an Emergent Field. She sites biasing concerns with using Twitter or Facebook for analytics similar to those biologists find with Drosophila flies: what‘s most accessible is not necessarily what’s most representative. “Twitter is used by about 10% of the U.S. population, which is certainly far, far from a representative sample. While Facebook has a wider diffusion rate, its rates of use are structured by race, gender, class and other factors and are not representative. Using these sources as ‘big data’ model organisms raises important questions of representation and visibility as demographic or social groups may have different behavior — online and offline — and may not be fully represented or even sampled via current methods.” The non-representative bias is further exacerbated by “one-shot, single-method” data collection designs that provide “no way to assess, interpret or contextualize the findings”. In short, with big data, sample bias + design bias = big trouble.
A little over a year ago, I wrote a blog on a thread, “Is data mining still a sin against the norms of econometrics?”, from the Advanced Business Analytics LinkedIn Discussion Group. The chatter was about a paper written by three economists examining the evolving perspective of data mining in their discipline.
Traditional econometrics revolves on testing man-made theories of social and economic phenomena. With this paradigm, the analyst specifies a model or generating process, assessing same with statistical techniques that generally resemble ordinary least squares regression. She then examines the significance level of model coefficients and interprets them accordingly.
Data mining or machine learning, in contrast, is often content to let the data talk directly, without the distraction of “theory” to test. Its algorithms are more flexible, its quality judged by the accuracy of predictions/classifications. In short, econometrics appears to be more concerned with explanation, while ML obsesses on discovery and prediction.
Trained in orthodox statistics/econometrics, I’ve become much more enthusiastic in the last five years about statistical learning, which sits between classical statistical/econometric modeling and machine learning, offering benefits of each. I believe that, over time, hawkish economists/statisticians are softening their position on ML as well, to the point that it’s now “almost” acceptable to the mainstream. Advanced Business Analytics Discussion voices appear to affirm that view.
My assessment? I consider myself no less an analyst because I often let data do some “theorizing” for me. In fact, I find the biasing risk of omitting important explanatory variables and overlooking non-linear relationships more serious than the specter of “overfitting” from mining often cited by opponents. I can control overfitting by choice of algorithms, by partitioning data into test/train and by cross-validation. I cannot compensate for the biases introduced by sub-standard theorizing.
More disconcerting to me are the “designs” – or lack of -- that surround the analytics, whatever their political persuasion. A single point in time collection of observational data offers little protection from competing hypotheses, regardless of whether traditional or ML models are applied. A randomized experiment, on the other hand, can, within the limits of probability, confront alternative hypotheses and confounding variables head on. Rigorous quasi-experimental designs such as time series panel with natural control groups offer considerable bias protection for the skeptical analyst who’s unable to randomize.
For initiates looking to begin engaging with statistical learning, I’d recommend a couple excellent new books authored by R community statisticians. At the top of the list is Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Kuhn is author of the much-lauded R caret package and a community luminary. Next is An Introduction to Statistical Learning with Applications in R, by Gareth James and Daniela Witten, former students of Stanford SL leaders Trevor Hastie and Rob Tibshirani. And an update to Vanderbilt Biostatistics Chairman and R leader Frank Harrell’s vanguard Regression and Modeling Strategies should be released within a year. I’ll have more to say about these books in subsequent blogs.