Several weeks ago, I came across an article, Data Science and its Relationship to Big Data and Data-Driven Decision Making, by Forest Provost and Tom Fawcett, that left me a bit uncomfortable. The authors have also just published a related book, Data Science for Business, that I’ve not read as yet. Author Provost is on the faculty of NYU, and is affiliated with a new MS in Data Science program at the university.
First the positives of the article: It’s certainly good to see the nascent discipline of data science receive the attention from academia it now deserves. New DS Masters curricula show promise of significant improvement over first generation Predictive Analytics programs, especially with added emphasis on computation.
That the authors promote DS in the service of data-driven business is an important point of departure. The support comes in several flavors, the first being traditional performance management that’s kept BI busy for 20 years. The second and newer role of DDB is espousing data/analytics as product, LinkedIn, Amazon and Facebook offering compelling illustrations of this model.
The fundamental concepts of data science the article articulates offer solid generic guidance for analytics professionals. To summarize, a top-down machine learning methodology with an obsession on the underlying design of data generation/collection is central to both the internal and external validity of data science inquiries. A randomized experiment or time series with natural control group design is much more “contextual” and promotes greater confidence than a one-shot correlational analysis. And data scientists must be wary of being too credulous with their data, taking precautions not to overfit ML models. All well and good.
So whence my angst? It’s the data science schemata of Figure 1 that seems to relegate data engineering and programming to auxiliary status in the DS process. With DE on the periphery, DS is essentially traditional predictive analytics with business flair. Indeed, the article’s cited DS examples of churn in the telecom industry and the Walmart big data finding of increased Pop-Tart demand in anticipation of hurricane landings are textbook PA.
The authors elaborate on their thinking. “Data engineering and processing are critical to support data-science activities, as shown in Figure 1, but they are more general and are useful for much more. Data-processing technologies are important for many business tasks that do not involve extracting knowledge or data-driven decision making, such as efficient transaction processing, modern web system processing, online advertising campaign management, and others……. In ten years’ time, the predominant technologies will likely have changed or advanced enough that today’s choices would seem quaint.”
Fair enough, but I’d counter you could make the same argument that statistical and machine learning methods are more general than their support of analytics in data science. And even as the data science discipline evolves, I’d posit the data side will continue to consume the preponderance of DS energy.
For my money, data science is a four-headed organism with focuses on business, data, analytics and narrative. Data engineering is every bit as critical to DS as predictive modeling. Yes, DE changes dramatically each technical generation, as it must to keep up with demand for newer, larger and more disparate sources of data. But data challenges aren’t going to fade away, or even become “solved” problems. Indeed, I’d conjecture that analytic modeling is at higher risk of becoming commoditized than data engineering, as machine learning replaces the “theorizing” of analysts. Techniques that include re-sampling, flexible splines and regularization already have analytics headed down that path. Get the data set ready, stand back, and let the models rip.
I’ve been in the data /analytics world my entire 30+ year career. In statistics graduate school, a wise professor told me to become a capable programmer because I’d probably spend most of my work time integrating data to support modeling efforts. He was right: my guess is that over the years, perhaps 75% of my statistical effort has been devoted to data management and programming.
The data engineering technologies have changed over the years, but challenges continue and take new shapes. And the pace has been accelerating over time. In 1982 I started out programming PL/I against hierarchical databases on expensive IBM mainframes. Three years later, I was using relational databases on Unix/C mini/microcomputers -- technologies that fundamentally enabled modern decision support-business intelligence.
The size of inquiries I could address economically jumped dramatically in that short period, and continues to be driven by Moore’s Law today. Now, much of my DS work is PC-based. 100 MB on the mainframe was big back in the day; 500 GB or even several terabytes is nothing now. Yet with all these performance advances, I still spend 75% of my DS time wrangling data. The problem domain demands expand with the technical advances.
I hope I’m wrong, but I’m concerned that the ubiquity and drudgery of data engineering DS practitioners experience daily escapes the academic world where divining the next great ML algorithm or applying models to interesting business problems is seen as sexier than wrestling terabyte data sets. If my career is any guide, that perspective would be mistaken.
I’m a big fan of the new university curricula in data science. The academic advances, however, must be grounded by the hard realities of the day-to-day DS data grind.