OpenBI is just wrapping up a machine learning engagement with one of our customers. Truth be told, instead of machine learning, the work’s probably more accurately pegged as consultant learning. What started out looking like an archetypical association modeling exercise quickly morphed into a data access and exploration “challenge.”
The day one “plan” of a dash of Hive, a pinch of SQL and two cups of R, turned into a stew of Hive, Pig, MapReduce, Java, Pentaho Data Integration and R. And the algorithms we ended up deploying look quite a bit less sexy than the elegant formulations we started with.
The bad news is that our efforts took a different turn every other week during the two month duration. The good news is that, in the end, what was learned will provide “lift” to the customer in the form of a prototype for a new piece of data business.
In retrospect, I guess I shouldn’t be surprised. The speed bumps we experienced along the way are common in the data science world where, alas, data sets are unlike the ones we analyzed in statistics classes.
OpenBI’s experiences with the complexities of data science were reconfirmed in an excellent presentation entitled “Big Data is Not About the Data!” by Harvard Professor of Government and Institute for Quantitative Social Science (IQSS) Director Gary King.
King’s point of departure that big data is meaningless without theories and algorithms contrasts with those by Viktor Mayer-Schonberger and Kenneth Cukier, authors of “Big Data: A Revolution That Will Transform How Live, Work and Think,” who argue that the big data’s emphases of N=all and correlation over cause and effect are now paramount. For the big data “revolution,” size and correlation trump algorithms and science. For King, more conservative, scientifically-grounded theories and algorithms rule. With the IQSS research, “The trick is to make yourself vulnerable to being proven wrong as many times as possible.”
The IQSS has progressed quite a ways from the traditional political science methods of surveys, government statistics and one-off studies of people, places and events. In fact, even though King calls what he and his colleagues do quantitative social science, he’s just as comfortable with the business monikers of big data analytics and data science. Experts vs. analytics of the BI world equals qualitative vs. quantitative research in academia. For King et. al., qualitative and quantitative methods are merging – to the betterment of both. Indeed, many of the new “algorithms” the IQSS develops involve computers working in tandem with experts. Machines do the heavy lifting, while experts handle the explanatory finesse. A fascinating illustration of this “new” research is outlined here.
Of course it helps that IQSS’ers are wicked Harvard smart. King shares the story of a colleague who came to his team bemoaning the inability of the statistical package Stata to scale for big data. Not a problem for the professor and his graduate student, who re-conceptualized the challenge in an afternoon to run on a PC!
Current techniques don’t quite fit? No sweat, the IQSS develops a new one. King and his students were unhappy with existing classification algorithms for research involving interviews of families in the developing world to predict causes of death. Aggregating individual classifications by category wasn’t accurate enough, and besides, the research team was interested only in category percentages, not individual classifications – “everybody, not anybody.” They ended up divining a new method that was the foundation for a startup company, Crimson Hexagon.
If you’re worried about the future of Social Security, you have plenty of company at the IQSS. Using much more sophisticated mortality forecasting techniques than those used by the SS Administration for 75 years, King and colleagues estimate the pension’s underfunded to the tune of a trillion dollars. Delaying needed changes will only exacerbate the problem. Die young!
The IQSS is exercising its joint expert/computer-driven approach in studying sentiments of Congress. Analyzing every Congressional press release over three years, the “Consilience” methodology characterized fully 27 percent of the communications as “partisan taunting.” Some congressmen taunt hardly all, while many are much worse than the norm. The IQSS plans to go all the way back through Congressional history to create a taunting “performance dashboard” website.
Finally, similar sentiment analyses of Chinese Web posts reveals that the government goes into censorship gear not over negativity and criticism but rather from attempts to collectivize, organize or mobilize the citizens, however innocently. Profane the leadership all you wish. Just don’t attempt to assemble a crowd to celebrate the beautiful weather.
I agree with King’s contention that quantitative social science = data science is at the very top of impactful university research today. Look for the IQSS to continue to advance the discipline from their academic podium. And I’ll be sure to monitor progress from the Open Thoughts on Analytics blog.