I just participated in a webinar entitled “Machine Learning for Hackers” by two current Ph.D. students, psychologist John Myles White of Princeton and political scientist Drew Conway of New York University.
To be honest, I was underwhelmed by their presentation. The MLH title had me anticipating exposure to the latest machine learning techniques, or perhaps ingenious applications of ML methods in business, or maybe even a new software implementation from open source “hackers.” Instead, most of the hour was devoted to the vanilla statistical methods of linear and logistic regression implemented in core R. You need only Google “linear regression with R” and “logistic regression with R” to see how those topics have already saturated the predictive analytics knowledgebase.
White and Conway are back in my good graces following a review of their new book of the same name, with the book delivering where the webinar did not. Viewing machine learning as a discipline that “blends concepts and techniques from many different traditional fields, such as mathematics, statistics, and computer science,” the authors purport “to teach machine learning through selected case studies.” They use the R statistical computing platform for their illustrations, acknowledging “The data sets we use are relatively small, and all of the systems we build are prototypes or proof-of-concept models.”
I like the case study approach with R a lot. The code included with MLH provides exposure to important R packages and programming motifs. The book data sets are varied, if small. And MLH examines classification and regression problems in a much wider context than the webinar, introducing a host of techniques from outside traditional statistics. As a big fan of both the power and performance of the glmnet package, I welcome the discussions of over-fitting, regularization and cross-classification. My overall evaluation of the manuscript: two thumbs up.
Conway has been a leader in defining the emerging data science discipline for some time, detailing the requisite skills: “First, one must have hacking skills …(which) in this context mean proficiency working with large, unstructured chunks of electronic data … Second, one needs a basic understanding of mathematics and statistics … Finally, and perhaps most importantly, a data scientist must have some substantive expertise in the data being analyzed.”
Rachel Schutt, who’s teaching “Introduction to Data Science” at Columbia, sees the field as “the intersection of statistics, computer science, data visualization and the social sciences.” A Ph.D. Statistician, Schutt has collaborated with eminent Columbia Statistician/Political Scientist Andrew Gelman. Gelman’s Statistical Modeling, Causal Inference and Social Science blog should be a must read for data scientists.
Duncan Watts, author of the terrific “Everything is Obvious: *Once You Know the Answer – How Common Sense Fails Us”, is both an academic sociologist and a Yahoo! data scientist. The questions he attempts to answer in the book are of interest to both academics and profit-maximizing businesses.
Finally, Paul Allison and the other authors of the excellent statistical methodology courses from Statistical Horizons, are primarily social scientists who emphasize statistical analyses in their academic work, much of which is pertinent to data science.
So is it the case that disciplines such as psychology, sociology, economics and political science are fertile training grounds for data scientists? I think yes, for several reasons: First, data science questions are often social science questions – the research into social networking is an obvious example. Second, the social sciences are now very statistical disciplines. It’s almost impossible to get an advanced degree from a top program without considerable exposure to statistical techniques. And third, the methods that social scientists now use are evolving from traditional surveys to examination of natural data – often big, natural data like that used by data scientists.
Gary King, Professor of Government and Director of the Institute for Quantitative Social Science at Harvard, is a leader of the academic “data science” movement. King opined in a 2011 article: “… social scientists are getting to the point in many areas at which enough information exists to understand and address major previously intractable problems that affect human society. Want to study crime? Whereas researchers once relied heavily on victimization surveys, huge quantities of real-time geocoded incident reports are now available. What about the influence of citizen opinions? Adding to the venerable random survey of 1000 or so respondents, researchers can now harvest more than 100 million social media posts a day and use new automated text analysis methods to extract relevant information.”
I’d encourage readers to review King’s research/software as well as that of other IQSS practitioners. I’m a regular consumer of the outstanding IQSS blog and take advantage of the many IQSS contributions to statistical methodology and R software in my own work. I hope to bring some of Gary’s perspectives on quantitative analyses to IM readers in coming months.