My Wake Forest and Atlantic Coast Conference (ACC) daughter got the last soccer laugh, well-enjoying watching her dad (me) eat analytics crow. A few weeks ago, responding to her bravado that all four top seeds in the D1 women’s soccer tournament were from the ACC, I used RPI rankings to counter that regular season data actually had the Big Ten as the top conference going into post-season, from my calculations slightly edging the ACC. The presumption was that the Big Ten and ACC would joust through the six rounds of the tournament.
It didn’t end up that way. Three of this year’s final four teams are from the ACC, with the Big Ten laid to rest in the quarterfinals. Just as bad for the Big Ten, ACC teams won five of the six head-to-head matches between the conferences. Alas, to my daughter’s taunts I could just meekly respond that analytics is probability, not math.
I’ve fared somewhat better with evaluation of the new data science books. The argument that data science, with emphasis on data munging/wrangling/integration, is more than just predictive analytics, has up to now generated more yays than nays. The post also received indirect affirmation from an interesting source.
Regular readers of Open Thoughts on Analytics know I’m partial to the work of the new breed of quantitative social scientists whose methodologies include the Internet/big data/statistical learning in addition to traditional surveys and regression. For my money, the teachings of academics like these provide much guidance for the developing field of data science.
I regularly report on research from Harvard’s Gary King and the Institute for Quantitative Social Science. In fact, King thinks much of current IQSS work is now indistinguishable from data science. A recent IQSS study on Chinese Censorship is methodologically groundbreaking.
Two years ago I wrote a series on the terrific book, Everything is Obvious: Once You Know the Answer, by sociologist and then Yahoo scientist, Duncan Watts. The engaging read not only challenged the notion that humans are the rational, objective, non-biased actors and decision-makers we think we are, but also introduced big data natural experiments to debunk the myths. Social science, big data and analytics.
I also generate ideas from the splendid blog by Columbia statistician/political scientist, Andrew Gelman. It turns out that Gelman was also the PhD advisor for Rachel Schutt, one of the authors of Doing Data Science, the DS read I enthusiastically blogged about last week. Like me, he contrasts Doing Data Science with Data Science for Business, the other tome included in the review.
Don’t include Gelman among those who decry data science as little more than statistics rebranded. Indeed Gelman sees data science as the much more difficult field. “ why do descriptions of data science focus so strongly on statistical tasks? ..I think it’s because statistics is the fun part and the part that, in this context, is new. The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option.”
Like me, Gelman also views data science as a discipline larger than statistics and predictive analytics. “There’s so much that goes on with data that is about computing, not statistics .The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option .Statistics can do all sorts of things. I love statistics! But it’s not the most important part of data science, or even close.”
Finally, Gelman sees Data Science for Business as a far different read than Doing Data Science. On DSB, “.. what surprised me, given the book’s title and our recent discussion on the nature of data science, was that the book was 100% statistics ..No code, no instructions on how to scrape or munge or whatever. There were some passages on data preprocessing and other nitty-gritty issues, but not so much .In any case, now I understand more why people say that “data science” is just another word for “statistics” (as applied to a particular sort of problem). If data science is defined as by Rachel Schutt and Cathy O’Neil, then, no, it’s a lot more than statistics, indeed statistics is only a small part. But using Foster Provost and Tom Fawcett’s implicit definition, data science is just statistics, albeit reframed and refocused in a way that is more useful for certain online settings.”
Amen. I’ll have a lot more to say about the work of Andrew Gelman and progressive academic social science quants in future blogs.