I finally got home Sunday evening from my crazy travels of last week. From Chicago to San Francisco and back for Strata 2012, to Winston-Salem NC and back for my daughter’s college visit.
Throw in bad connections, weather delays and flight cancellations, and you have the makings for quite the memories. Fortunately, despite the many inconveniences, both trips were very much worthwhile.
Strata Santa Clara 2012, the “leading event for the people and technology driving the data revolution,” was an even bigger deal than Santa Clara 2011, which was a significant hit for a 1.0 conference. There were almost 2,500 registrants in 2012 taxing facilities of the convention center compared to 1,400 or so last year. The crowd seemed older this year as well, perhaps reflecting the growing maturity of the discipline. Mainstream vendors such as Oracle, Microsoft and IBM were much more prominent in 2012, serving as event sponsors and major exhibitors. Indeed, when I first experienced the extravagant EMC booth in the exhibit hall, I thought I was in Las Vegas rather than Santa Clara. What a difference a year makes.
Several of the 10-15 minute keynotes on Feb. 29-March 1 were exemplary. Apache Software and Cloudera’s Doug Cutting was ever credible as a data science statesman in his presentation on the Hadoop ecosystem. And I was pleasantly surprised with the Hadoop embrace articulated by David Campbell from none other than Microsoft. Avinash Kaushik was most entertaining in his frenetic performance describing the framework of ex-Secretary of Defense and current data science “luminary” Donald Rumsfeld. Kaushik is much less impressed with the data puking of “known-knowns” than he is with the data orgasm-inducing “unknown-unknowns.”
Physician/data scientist Ben Goldacre presented a sobering view of scientific medicine, arguing that the “information architecture of medicine is broken.” A combination of bad A/B tests (clinical trials), poor synthesis of existing studies, and ineffective communication can torpedo the best intentions of evidence-based medicine. In one meta-analysis, anti-depressant drug study, what appears to be strong supportive evidence (36 + studies, 1 – study) turns out to be non-committal at best (37 + results and 36 – results) when all known studies are summarized.
Finally, Google’s Hal Varian, if not the originator of data science, certainly its patron saint, dazzled with analytics and graphics using Google Trends, Google Insights for Search and R. Data science at its best.
The many breakout sessions forced me to make decisions on the themes I wished to pursue. I opted to pretty much avoid the vendor-sponsored presentations and decided also to leave the technical, big data infrastructure sessions to my OpenBI colleagues. My foci were data strategy, visualization and analytics.
Karmasphere founder Martin Hall contrasted data science teams with their traditional BI/DW progenitors, the point of departure differences revolving on relational database versus Hadoop-NoSQL infrastructures. Comfortingly, team roles are conceptually not too dissimilar, with a path from BI’s data analysts to DS’s data scientists, and from BI’s DBAs/ETL programmers to DS’s MapReduce-NoSQL coders. Emerging software that bridges BI and DS data might turbo-charge that transition.
Tableau executive, Stanford Ph.D. and visualization notable Jock MacKinlay got the longer presentations off to a great start with his highly-informative “Science of Visualization.” Driving from the study of human perception, MacKinlay used simple Tableau illustrations to argue that position, size and color – in that order – are significant determinants of statistical visual quality. He also espoused the use of lattice graphics --“small multiples” – for visualizing dimensional data.
Two of the analytics presentations I caught, while informative, were less data science and more BI/data warehousing. “The Mining of Eventbrite” by Vipul Sharma, in contrast, showed the best of big data management and data science analytics. The work of the Eventbrite team revolves on recommendations for future events, emphasizing social and interest graph-based methods. Vipul’s presentation simultaneously illuminated the handling of 15 M customers with 1.1 trillion graph edges in the Hadoop infrastructure and machine learning with logistic regression on a sparse feature set for classification.
The “press conference” linking key presenters to journalists/media was a personal highlight. Discussion revolved on the role of experts versus analytics in evolving data science. The group debated the value of domain experts in calibrating both inputs and outputs of mathematical optimization, as well as the ageless question of correlation versus causality. While most panelists felt expertise was important to framing hypotheses and guiding the scientific method, the thinking was not unanimous. A minority view held that a rigorous methodology in the absence of domain expertise might be optimal. A left-unanswered rhetorical question was what happens when everyone uses big data/analytics – are the benefits of analytics zero-sum?
While Kaggle President Jeremy Howard might question the need for domain expertise to drive data science, he certainly promotes a rigorous methodology and learning framework. For Howard, it’s identify objectives, determine what levers can be pulled, find data that links levers to objectives, and evaluate the linkages with algorithms. His is a true scientific framework that deploys randomized experiments to simulate and optimize outcomes. And you’ve just got to love a data scientist who speaks of using randomized experiments as a proxy for divining “counter-factuals.”
I can’t wait for Strata 2013 Bay Area. Maybe next year the Moscone Center in San Francisco?