I just completed reads of two new books on data science.
The first, Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking, was written by Foster Provost and Tom Fawcett. The second, Doing Data Science: Straight Talk from the Frontline, was co-authored by Rachel Schutt and Cathy O’Neil.
Data Science for Business is a pretty good book but on the subject of predictive analytics, not data science. It’s the longer version of the article by the authors I blogged on several weeks ago, with the same DS shortcoming: a lack of attention to the all-important topics of computation and data programming. Yes, data engineering and computation support more than simply data science. But expertise in programming is sine qua non for data scientists, and to divorce those skills from the treatment of DS is a disservice to the discipline. Indeed, when I’m recruiting students to hire as apprentice data scientists at OpenBI, programming and computation are at the top of the skill list, ahead of statistical science and machine learning.
For my money, data science is a four-headed organism with focuses on business, data, analytics and narrative. DSB covers the business and analytics parts very well. Ten of the fourteen chapters, 270 of 345 pages, are devoted to predictive analytics, the remainder to business. The chapters on “Overfitting and Its Avoidance” and “Evidence and Probabilities” are noteworthy.
If DSB feels like a top-down, somewhat diminished academic treatment of “data science”, Doing Data Science seems every bit the bottom-up perspective of the in-the-trenches practitioner. To be sure, Schutt and O’Neil possess serious academic chops -- Schutt holds a PhD in statistics from Columbia, while O’Neil earned a doctorate in math from Harvard but it’s the authors’ industry creds on top of their academics that make this book compelling. Schutt acknowledges her “education” moving from Columbia to a job at Google: “It was clear to me pretty quickly that the stuff I was working on at Google was different than anything I had learned in school when I got my PhD in statistics there were also many skills I had to acquire on the job at Google that I hadn’t learned in school ”
Schutt and O’Neil distinguish academic and industry flavors of data science. An academic DS is a scientist “trained in anything from social science to biology, who works with large amounts of data, and must grapple with computational problems”. An industry data scientist is “someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning ..She spends a lot of time in the process of collecting, cleaning and munging data This process requires persistence, statistics and software engineering skills.”
I got a lot out of Doing Data Science, finding the chapter organization on business problem specification, analytics formulation, data access/wrangling, and computer code to be very helpful in understanding DS solutions. The chapter on “Spam Filters, Naïve Bayes and Wrangling” is a good example of the approach. The authors start with a “Thought Experiment” to motivate the discussion, then consider alternate prediction models, settling on Naïve Bayes, and showing how the algebra works. They finally gussy-up the model, illustrating with shell and R scripts that detail web-scraping, data wrangling, organizing and formatting to create train and test data sets for subsequent analyses. In the end, the reader sees all facets of the data science approach illustrated multiple times.
I’ve had the pleasure of meeting both Schutt and O’Neil at the Strata, Santa Clara conferences the last couple of years and am impressed with their data science thinking. Schutt’s DS and O’Neil’s DS/financial math work experiences complement their academic perspectives nicely. Add to that their experience getting data science on the map at prestigious Columbia, and they have my DS bases well-covered.
All told, one decent book on predictive analytics and one excellent read on data science. If the tomes were DS programs, though, I’d no doubt sign up for Doing Data Science.
Look for the continued emergence of data science curricula in academia, with new Masters’ programs like those at NYU, Columbia, Berkeley and Illinois Institute of Technology supplanting first generation Predictive Analytics kin. The winning curricula, like the winning data science book, will combine the vision, organization and rigor of academy with attention to the day-to-day technical demands of practitioners. I’ll propose my own version of an MS data science curriculum in an upcoming blog.