Sadly, I had to return to Chicagoland last Thursday afternoon from Strata 2014 in Santa Clara. The pleasant Silicon Valley weather was a much-needed respite from the severe Midwest winter. And Strata 2014 was plenty good too.
Like most of the record 3,100 attendees, I participated all three days day one of tutorials and days two and three of short keynote and 40 minute sessions. My OpenBI colleagues and I had a number of conference objectives revolving on affirming the latest and greatest big data/analyics technologies and methods. The consensus amongst us was mission accomplished.
I went into the tutorials looking for corroboration of a “macro” platform for data science as well as a “micro” tool I could use in my work. I think I found both.
Last year, I wrote on the Strata 2013 excitement surrounding BDAS, the Berkeley Data Analytic Stack under development at the UC Berkeley AMPLab. That enthusiasm has escalated this year, as BDAS begins to achieve its original goals:
- To combine the now-disparate handling of batch, interactive and streaming data into a single execution engine
- To readily accommodate sophisticated machine learning algorithms, and
- To be compatible with the existing Hadoop ecosystem.
- To outperform Hadoop by at least half an order of magnitude.
The morning overview and afternoon hands-on left me guardedly optimistic BDAS will in short order provide a one-stop platform for big data/analytics computation. The embellishment of last year's base Spark/Shark/Spark Streaming environment to include: Tachyon, a fault tolerant distributed in-memory file system; BlinkDB, a distributed query database that provides ultra-low latency results via approximate, error bounded results; MLbase, a package for implementing and consuming machine learning algorithms at scale; and GraphX, which extends Spark with a new graph API leveraging recent advances in graph systems; is most impressive. That both BDAS tutorial sessions were filled to capacity and that the recent Spark Summit in San Francisco attracted 450 participants are also telling. Icing on the cake is the recent Yahoo success story deploying Spark/Shark.
So much for macro. My search for a simple, personal Hadoop programming environment was driven by two requirements: 1) no MapReduce and, 2) minimize Java coding. In time Spark-Python will do the trick, but for now, Scalding, a Scala “DSL” (domain-speci?c language) that wraps Cascading and provides an intuitive, boilerplate-free API for writing MapReduce jobs, appears attractive. LinkedIn data scientist Vitaly Gordon provided a quite comprehensible introduction so good in fact I'm confident I can become proficient in the functional Scala metaphor without much ado.
For me, this year's analytics focus assumed a feature selection theme. Stanford professor Christopher Re used a street artistry metaphor as the departure for his talk on predictive modeling. With analytics, pure artists such as Bansky are rare; methodical pluggers like Mr Brainwash are much more the norm. Analytics would thus be better served obsessing less on algorithm art and more on the plodding search for predictors of outcomes of interest.
Berkeley and University of Washington professors/Trifacta execs Joe Hellerstein and Jeffrey Heer discussed challenges/approaches to data wrangling/transformation of early phase data. Their foci are data trust, usability, credibility and usefulness. The Trifacta technology platform uses a “Predictive Interaction” approach that drives from visualization and learning to help analysts discern patterns in data. For me, a high-priority use case for this software is the formulation of features to be used in subsequent predictive models.
Olivier Grisel's talk on Predictive Modeling in the Cloud with scikit-learn was especially pertinent for OpenBI as we start to use Python more for data science. Grisel worked through several examples, showing how to use parallel computation on the Cloud to deliver performant ML models. Feature selection from what is often thousands of possibilities is a central modeling topic.
Vanguard data-driven companies Netflix, Facebook and Yahoo gave peaks inside their evolving technology portfolios. Netflix is going all-in to the Cloud with AWS. And all three are committed to open source software, making large bets on Hadoop SQL platforms like Presto and Shark at the expense of relational/analytic database technologies.
Several keynotes were outstanding. Crossing the Chasm author Geoffrey Moore updated his 25-year-old research for the digital 2014. Two timeless principles: target a beachhead market segment and commit to building an entire product. Today, it's consumer in contrast to enterprise IT. Big data visionaries Amazon and Google pose existential threats to retail and advertising.
Historian and futurist nonesuch James Burke engaged the audience in a 30 minute tour de force lecture The Future Isn't What it Used to Be. Elucidated by networking software, Burke's accounts of seemingly random social connections that impacted history dazzled.
Street skateboarding icon Rodney Mullen regaled the audience with his talk The Art of Good Practice. Ever the scientist, Mullen tunes his craft with a methodical, experience-driven approach. He attributes much of his success to the systematic application of quasi-experimental learning in practice. “The better we tune our practice, the more practice will make perfect.”
Digital Media expert and cognitive scientist David McRaney examined biases that subvert evidence-based behavior as well as the art of luck in his fast-paced talk, Survivorship Bias and the Psychology of Luck. Studying successful companies is generally not very illuminating since the most important advice revolves on what not to do what is missing from the “winners”. McRaney opines that prosperous companies have many moments where luck is all that separates their blessings from calamities. Luck though, should be seen as the conscious interaction with chance. Those whose decision-making portfolio is diversified, who're set up “to serially avoiding catastrophic failure while routinely absorbing manageable damage”, will likely be luckiest. Nassim Taleb would agree.
After four conferences at the Hyatt Regency and Convention Center in Santa Clara, Strata's become a victim of its own success, outgrowing the comfortable venue. The next Strata will be February 18-20, 2015 at the San Jose Convention Center. Like always, I look forward to it. Center