The first perspective, “Data-Driven Science is a Failure of Imagination,” from r-bloggers.com, seems written by a traditional statistician. The author takes issue with “data-driven scientists” who propose “that the best way towards scientific progress is to collect data, visualize them and analyze them.”
His big beef is that data science equates statistical science with “large N” data, mindlessly searching for patterns via visual exploration and machine learning. That approach, the author scoffs, “ignores the second component of statistics: hypothesis (here equivalent to model or theory),” adding, “There are two ways to define statistics and both require data as well as hypotheses: 1) Frequentist statistics makes probabilistic statements about data, given the hypothesis. 2) Bayesian statistics works the other way round: it makes probabilistic statements about the hypothesis, given the data.”
From this vantage point, the risks of over-emphasizing data at the expense of hypotheses are high. “We ignore the actual thinking and we end up with trivial or arbitrary statements, spurious relationships emerging by chance ... but with no real understanding. This is the ultimate and unfortunate fate of all data miners.”
The author’s cynical explanation for the evolution from top-down, hypotheses-driven statistical science to bottom-up, large N-driven data science? “An instinctive craving for plenty (richness), and by boyish tendency to have a ‘bigger’ toy ... data-driven science is less intellectually demanding than hypothesis-driven science. Data mining is sweet, anyone can do it ... (but) thinking about theory can be a pain and it requires a rare commodity: imagination.” Yikes. Pretty old school!
In contrast, the second article, “Big Data, Small Bets,” by business academic Robert Carraway, espouses a bottom-up approach to analytics that acknowledges advances in large N data management and statistical computation. Indeed, Carraway’s BDSB brings to mind the “Super Crunching” of Ian Ayres – a confluence of ubiquitous data, analytics and randomized experiments.
Like Ayres, Carraway is skeptical of big data and analytics in the absence of experimental confirmation, citing the risk of small sample bias – finding patterns where in fact there are none. “We are wired to believe that any apparent pattern must have an identifiable cause, and thus cannot be random. Hence, any time we see anything remotely resembling a pattern, we assume there must be something non-random causing it.”
To mitigate that risk, Carraway proposes the “small experiment” that differs “from big ‘data mining’ (the term used for mucking around in big data for whatever you can find) in that it is deliberately constructed to test whether or not a perceived pattern is real or a figment of our overactive imaginations ... A carefully constructed experiment can make you far more confident that what you have spotted is real and therefore actionable.”
In analytics practice, small experiments are often exercises in cross validation, wherein the big data set is partitioned into independent random subsets for training, tuning and testing. “By generating insights on a subset of our data ... we can experimentally test these insights on ‘holdout samples’... If the pattern persists over the holdout sample, we can then move to test it further by designing an experiment to gather new ‘live’ data.” Summarizing the value of big data and small bets for business, Carraway opines: “The enhanced ability that exists today to spot patterns and identify potentially exploitable relationships must be accompanied by the ability to do some good, old-fashioned ‘fact-checking’ in the form of small experiments to confirm assumptions and hypotheses.”
Me? Twenty years ago, I was a statistical purist, an unabashed, hypothesis-driven fanatic – perhaps as much as the first author. In fact, I probably would have been considered a top-down analytics “planner.” I did, however, come to expand my horizon to a broader data science discipline that combines statistical orthodoxy with large N data, exploratory visualization and machine learning techniques – balancing an aggressive search for predictive relationships with the cross-validating protection of the no-pattern null hypothesis. So I guess I’ve now evolved into more a bottom-up data-driven “searcher.” Statistical science, though still a central tool, is only a part of a larger analytics portfolio.