I recently came across an outstanding article on data science, thanks to the always-informative R-bloggers website. Written by Stanford professor David Donoho, “50 years of Data Science” views DS through an historical lens and as well provides a conceptual framework for the evolving discipline.
The point of departure for 50 years is the current squabble in the data industry as to whether data science is really the same as traditional statistics. I've argued many times that the disciplines are quite different, but, alas, others disagree.
Donoho, a statistician, believes the fields are different. His starting point is the simple but elegant depiction of data science as the science of learning from data. Most definitions today, mine included, focus on skills – the “industrial” – rather than the basic academic or “intellectual” foundations, which are independent of particular technologies and algorithms.
50 Years pays homage to the long-standing debate on the perils of an overly-mathematicized statistical discipline, deftly detailed by John Tukey in his vanguard 1962 paper, “The Future of Data Analysis”. There, Tukey opined, “as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. ... All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate”
Tukey's depiction of “data analysis” as engaging “1. The formal theories of statistics 2. Accelerating developments in computers and display devices 3. The challenge, in many fields, of more and ever larger bodies of data 4. The emphasis on quantification in an ever wider variety of disciplines” was remarkably prescient.
The divide between mathematical statistics and “data analysis” persisted with Tukey's then-junior colleagues, John Chambers and William Cleveland, at the august Bell Labs. Chambers, co-developer of the R-predecessor S language, weighed in starkly in his article “Greater or Lesser Statistics, A Choice for Future Research”: “The statistics profession faces a choice in its future research between continuing concentration on traditional topics – based largely on data analysis supported by mathematical statistics – and a broader viewpoint – based on an inclusive concept of learning from data. The latter course presents severe challenges as well as exciting opportunities. The former risks seeing statistics become increasingly marginal...”
Cleveland coined the term data science in his 2001 paper, “Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics”. Cleveland's data science “should be judged by the extent to which (it) enables the analyst to learn from data.” Models and Methods for Data and Computing with Data are paramount among Cleveland's data science foci.
Nowhere is the schism between mathematical statistics and learning from data more prominent than in the seminal 2001 paper “Statistical Modeling: The Two Cultures” by the late UC Berkeley statistician, Leo Breiman. Breiman first identifies the cultures: “There are two goals in analyzing the data: Prediction. To be able to predict what the responses are going to be to future input variables; [Inference].To [infer] how nature is associating the response variables to the input variables also takes issue with the mathematization of statistics”
Breiman then chides statistics for its obsession with mathematical models. “The statistical community has been committed to the almost exclusive use of [generative] models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. [Predictive] modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on [generative] models ...”
Breiman's emphasis on predictive in contrast to generative models led to the “secret” sauce methodology for developing predictive models now called the Common Task Framework. CTF includes three ingredients: “(a) A publicly available training dataset involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation. (b) A set of enrolled competitors whose common task is to infer a class prediction rule from the training data. (c) A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule.” Emphasis on the CTF has directly led to noticeable performance improvements in predictive modeling.
Building on this history and the emergence of data science in academia, Donoho and 50 years articulates a Greater Data Science (GDS) discipline with 6 divisions: “1. Data Exploration and Preparation 2. Data Representation and Transformation 3. Computing with Data 4. Data Modeling 5. Data Visualization and Presentation 6. Science about Data Science”
Through the lens of “learning with data”, Donoho's specializations resonate well today – and can grow with the discipline over time. The foci on exploration, preparation, transformation, computation, visualization and presentation clearly differentiate GDS from traditional mathematical statistics. The R Project for Statistical Computing receives special kudos for having “transformed the practice of data analysis by creating a standard language which different analysts can all use to communicate and share algorithms and workflows.” Relatedly, the“Science of Data Science”,with emphases on meta, cross-study, and cross-workflow analyses, establishes DS's bona fides as an academic discipline.
50 Years prognosticates a bright future for data science: “GDS proposes that Data Science is the science of learning from data; it studies the methods involved in the analysis and processing of data and proposes technology to improve methods in an evidence-based manner. The scope and impact of this science will expand enormously in coming decades as scientific data and data about science itself become ubiquitously available....In 2065, mathematical derivation and proof will not trump conclusions derived from state-of-the art empiricism. Echoing Bill Cleveland’s point, theory which produces new methodology for use in data analysis or machine learning will be considered valuable, based on its quantifiable benefit in frequently occurring problems, as shown under empirical test.”
Exciting times for data freaks, indeed.