More on Statistics vs Data Science

Register now

I met up with an old stats grad school friend the other day. When last we got together a few years ago, he went on a rant about “data science”, suggesting the term's nothing more than a pretentious new moniker for the same statistical work he's been doing for 35 years. I disagreed, noting a substantial evolution from our early statistics days in the breadth of problems, especially involving computation, we address today. I guess his thinking about the statistics-data science divide was akin to FiveThirtyEight's Nate Silver; mine was more like statistician Andrew Gelman.

I was a bit surprised to note my friend had mellowed little in his statistical thinking. He did acknowledge that predictive modeling from traditional statistics serves a different purpose than the machine learning prominent in business today – and, more importantly, that both types must now be part of the modeler's arsenal.

Statistical models, which emphasize inference driven from underlying or “generative” probability distributions, are concerned with both explanation and prediction, while ML obsesses on prediction, laser-focused on algorithms that learn from data.

As the late statistical model skeptic Leo Breiman noted: “The goals in statistics are to use data to predict and to get information about the underlying data mechanism. Nowhere is it written on a stone tablet what kind of model should be used to solve problems involving data. …..the emphasis needs to be on the problem and on the data.”

My friend budged little at my argument that the computation challenges faced by today's “data scientists” are well outside the purview of the traditional statistics we learned in grad school and practiced in the 80s. And it finally dawned on me that perhaps the term “data science” has become too much of a lightening rod for him. I wondered if instead of statistics-data science dichotomy, my friend would be more comfortable calling the evolution from Analytics 1.0, where we started years ago, to the more comprehensive Analytics 2.0 of today, as nicely articulated by Michael Li. At least Li's statement pays lip service to the importance of the past!

“Big data and data science is so much in vogue that we often forget there were plenty of professionals analyzing data before it became fashionable. This can be thought of as a divide between Analytics 1.0, practiced by those in traditional roles like data analysts, quants, statisticians, and actuaries, and Analytics 2.0, characterized by data scientists and big data. Many companies scrambling to hire data science talent have begun to realize the wealth of latent analytics talent right at their fingertips — talent capable of becoming data scientists with a little bit of training. In other words, the divide between Analytics 1.0 and 2.0 is not as wide as you might believe.”

Or perhaps like academics Provost and Fawcett, maybe my friend acknowledges the increased demands for computation and data management in the statistical world, but sees those tools not as part of data science, but rather as a separate, complementary discipline of data engineering and processing. However you apportion it, those who sell themselves as data scientists in today's marketplace must present advanced computation skills to get good jobs.

We do know that data science-oriented computation classes are becoming more common in university statistics curricula, driven by recognized need.

“A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to use databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes.”

For me, the data science vs statistics polemic of eminent statistician David Donoho pretty much lays the issues to rest. With a simple point of departure that data science is the discipline of learning from data, Donoho uses 50 years of statistical science history to build the case that data science is substantially broader than statistics. Donoho conceptualizes a Greater Data Science that consists of six divisions comprised of:

  1. 1. Data Exploration and Preparation
  2. 2. Data Representation and Transformation
  3. 3. Computing with Data
  4. 4. Data Modeling
  5. 5. Data Visualization and Presentation
  6. 6. Science about Data Science.

Data Exploration and Preparation, where I spend most of my statistical time, posits “that 80% of the effort devoted to data science is expended by diving into or becoming one with one’s messy data to learn the basics of what’s in them, so that data can be made ready for further exploitation.” Data Representation and Transformation has to do with the different data formats and representations, such as databases and Hadoop, that data scientists wrestle with.
In Computing with Data “Every data scientist should know and use several languages for data analysis and data processing. These can include popular languages like R and Python, but also specific languages for transforming and manipulating text, and for managing complex computational pipelines.” Data Visualization and Presentation is a close cousin to the business intelligence and visual analytics we all know and love.

Data Modeling is simply statistical modeling, with generative models representing the statistical science side, and algorithmic models comprising machine learning. These endeavors pretty much coincide “with traditional Academic statistics and its offshoots.”

And finally, data scientists are doing “Science about Data Science when they identify commonly-occurring analysis/processing workflows, for example using data about their frequency of occurrence in some scholarly or business domain; when they measure the effectiveness of standard workflows in terms of the human time, the computing resource, the analysis validity, or other performance metric, and when they uncover emergent phenomena in data analysis, for example new patterns arising in data analysis workflows, or disturbing artifacts in published analysis results.”

I found Donoho's conceptualization and supporting arguments compelling and look forward to sharing his thinking with my cynical friend next year.

(About the author: Steve Miller is president and owner of Inqudia Consulting in the Chicago area).

For reprint and licensing requests for this article, click here.