Big Data and Caution at Strata 2013
Strata 2013 is now in the books, the largest and most successful edition yet. There were 2,900 registrants in 2013, compared to 2,500 last year and 1,400 participants in the 2011 inaugural.
Perhaps even more telling, 90 exhibitors touted their wares in 2013; at least half of their products part of the Hadoop ecosystem. I can’t prove it with data, but my impression is that the age distribution of Strata participants is trending up with more management types as data science/big data become mainstream.
I generally found value in the “small plate” plenary sessions that kicked off the conference Tuesday and Wednesday mornings. The 10 minute vendor keynotes were uneven, though: in “Xbox Data is XXL,” Dave Campbell provided an informative, one-year-later update of data science from the Microsoft perspective. He also introduced Microsoft’s Data Explorer for Excel, a nifty tool for building data sets from disparate sources such as Web pages. The talk by SAP on their new HANA In-Memory Computing solution, in contrast, was little more than an infomercial for the product. I could have heard the same pitch at their exhibit booth.
The most provocative predictive analytics talk was given by Eric Colson of clothier Stitch Fix. Stitch Fix uses recommendation engine technology similar to that of Netflix to propose garments for its customers – to “pick out items they think you’ll love – sometimes a little out of your comfort zone, but that’s part of the fun.” For Stitch Fit, the human side – the stylists – takes up with the analytical blueprints suggested by data and algorithms to refine recommendations for customers.
A second intriguing talk revolved on Code for America, “A Peace Corp for Geeks,” that selects dozens of fellows annually for one year tours of duty helping governments use data and technology to improve service. Jennifer Pahica, in her presentation “Moneyballing Government,” reports on “using data to unclog the criminal justice system in Louisville and New York City.” Analytics aimed at helping manage the large cost of pre-trial incarceration revealed that those denied bail were five times more likely to be convicted than those granted. It also found that suspects with small bonds were often unable to secure release, apparently of little interest to profit-seeking bail bondsmen. These denied-bail, petty suspects are then at risk of graduating to higher criminal attainment.
A consistent theme of the keynotes was caution about over-exuberance with the data-driven world. RPI professor James Hendler fretted over the absence of shared semantics and meta-data for increasingly-available Web data sets in his keynote “Broad Data.” Twitter’s Nathan Martz lamented the pervasive problem of human error in his “Human Fault Tolerance” presentation. A critical component of Martz’s solution? Immutable Data – no updates or deletes, just a history of time-stamped records to chronicle change in values. Data warehouse designers will recognize this as the Type II solution to slowly changing dimensions.
And social scientist Kate Crawford of Microsoft Research and the MIT Media Lab, in her presentation “Algorithmic Illusions: Hidden Biases of Big Data,” was fearful of blind acceptance of the objectivity of big data. She offers as evidence a mobile app available in Boston to identify potholes for the city, its accumulated data biased in the direction of oversampling the young and affluent. According to Crawford, analysts must ask how and why in addition to how many. Her ultimate compromise: a hybrid, counter-balancing methodology of computational and qualitative techniques.
Sometimes simple is best. Michael Bailey of Facebook gave a practical talk on statistical forecasting, covering basic topics such as moving average, exponential smoothing, seasonal decomposition, regression, autocorrelation, Box-Cox transformations and ARIMA. Bailey’s delivery was engaging, his content accessible. A great overview – and review.
Rachel Schutt was at it again with her presentation “Next-Gen Data Scientists.” Now employed at Johnson Research Labs, Schutt, a Ph.D. statistician, is also on the faculty of Columbia University, where she taught a course “Introduction to Data Science” in the fall. She shared her unique perspectives of adapting to the very different role of data scientist at Google from her academic statistical background, and also learning as an instructor to a large class of enterprising graduate student at Columbia.
I came away from Strata with a few new freely-available tools for my data science arsenal. The beta Data Explorer for Excel mentioned earlier is quite promising for automating data scrape from the web, XML and other sources, as is Google’s Fusion Tables, championed by Guardian data journalist Simon Rogers, for “collaborating, visualizing and sharing.” And then there’s the iPython Notebook, presented as a DS tool by computational physicist Brian Granger of Cal Poly San Luis Obispo. Notebook enables storytelling with code and data, combining “code execution, text, mathematics, plots and rich media into a single document.” Already an iPython user, I’ve had access to Notebook on my computer for quite a while without knowing it.
Finally, I circled back to BDAS, the Berkeley Data Analytics Stack, in a joint presentation by ClearStory Data and UC Berkeley, “Beyond Hadoop and MapReduce: Interactive Insights Using Spark.” BDAS has become an enabling platform for ClearStory’s mission to converge insights across multiple web, public, private and premium data sources with live situational speed, low latency, high interactivity and automated analytics. Even as a version one product, BDAS is delivering value to the business world – and this bodes well for the next generation of data science.