Having flown from blizzard-in-waiting Chicago to the Bay Area on Monday, January 31, I was ahead of the game even if the O’Reilly Strata Conference in Santa Clara had been a bust.
As it turns out, not only did I avoid the ugly Midwest storm, I was able to revel in the delightful Northern California winter weather while enjoying the opportunity to participate in a bellwether conference on data science.
Strata was a complete sellout at the Santa Clara Convention Center, a remarkable accomplishment for a version one event. There were over 1,400 participants, of which 13 percent were from outside the U.S. It appears the success of V1 will make Strata a semi-annual affair, the next being September 19-21, in New York.
O’Reilly Media is, of course, no stranger to success in its core space of technology education and communication. Long the leader in software and methodology how-to books, the company has now become an industry-shaping force under charismatic leader Tim O’Reilly. In addition to Strata, O’Reilly’s MySQL, Web2.0 and open source (oscon) conferences have been enormously successful.
Making Data Work was all about the emerging discipline of data science, a short description of which is “telling stories through data” – for business benefit. A more comprehensive discourse is available from the excellent O’Reilly paper: “What is Data Science.” The conference bulletin board was testimony to the demand for data scientists (see image below). Indeed, many of the presentations concluded with solicitations for qualified DS practitioners. Data is indeed the new Intel Inside.
The conference gets its name from William Smith, an 18th century British scientist who developed a stratified geologic map of the UK and subsequently assumed the nickname Strata. His work helped to shape the economic and scientific development of Britain at the time of the Industrial Revolution. The hope is that this conference will similarly shape the imminent Data Revolution.
I participated in two pre-conference tutorials, the first on how to use Hadoop to develop big data applications, the second on database Apache Cassandra in action. At first I was a bit put off by the company marketing of Hadoop presenters Karmasphere, Amazon and Concurrent. But once I understood the value-add to the Hadoop ecosystem of their products, I was more than ok. The presenters did an outstanding job describing the Hadoop landscape and the hands-on lab, after a shaky start, was well received. Ditto for the work of Cassandra presenter Jonathan Ellis of DataStax. Ellis’s comprehensive architecture discussion of Cassandra, especially in contrast with traditional RDBMs, was most informative. It reminded me of my old Oracle days.
Conference chairs Edd Dumbill and Alistair Croll tipped off the show each morning by introducing the lightening-speed keynotes. James Powell of sponsor Thomson Reuters discussed concerns of privacy with behavioral data in the B2B environment. He noted special issues surrounding the loss of context with multiple devices. For bit.ly’s Hilary Mason, data opportunities include “narcissism” – what I need to know, segmentation over time, location and topic, and global data. She showed her near real time graphic of Internet access in Egypt. Mark Madsen spoke of the evolution of data from product, to byproduct to asset and now to substrate – the basis for competition. He warned that data science toolmakers’ success is measured by users. Amazon CTO Werner Vogels opined that big data has no limits by definition. His data framework? Collect-Store-Organize-Analyze-Share.
Day 2’s keynoters were no less informative. Journalist Simon Rogers asked whether we’ll become a government of digital democracy and frets that the expectation in the legal world will be the recording of all behavioral exhaust. Data Warehousing pioneer Barry Devlin argues that coherence of traditional IT-led DW must be integrated with the innovative chaos of data science. LinkedIn’s DJ Patil presented connectivity figures for Strata, noting that 189 participants were connected to 5 or more other “Stratans." He then introduced LinkedIn’s new skills app, showing visualizations depicting relationship clusters. For Greenplum’s Scott Yara, the data science call to action revolves on the power of education and startup innovation.
Former actuary and now health care analytics expert Carol McCall thinks that even solving “hot spot” problems with the a priori understanding that data quality is poor can help health care immeasurably. Noting you can’t prevent what you can’t predict, she describes a successful study of adverse drug events for Humana Medicare patients that would prospectively lead to savings on the order of existing company profits. Only the mini keynote of EnterpriseDB CEO Ed Boyajian fell flat, the cost savings alone message of his commercial open source product not compelling enough for this audience.
Longer presentations were generally well received too. Matt Biddulph of Nokia discussed prototyping with data in search of new product opportunities. His framework for evaluating opportunities is a triangle with vertices Novelty, Desirability and Fidelity – the prototype’s similarity to what would actually be deployed. The location of a new concept along those axes will help data scientists explore and discover. He also recommends looking for data outside the company and building high quality exploration tools to beat data with the “insight stick”.
Phillip Kromer addressed big data for lean startups, drawing on experience with his data marketplace company, infchimps.com. His recommendations include optimizing on the people side of the business, often the most expensive line item. Kromer prefers a role-your-own approach to talent acquisition, in many cases going with recent university graduates who have a high ceiling, along with a passion to learn and a “get sh— done” attitude. He prefers to lease before he buys, testing out potential hires first as contractors. His mentoring technique? Optimize for programmer joy using tools like Ruby and R on top of a Hadoop and Cassandra infrastructure, and give new hires ownership of a list of 5-7 current tasks so they can “fail in parallel” and thus promote teachable moments. OpenBI agrees with Kromer’s approach to acquiring talent, having hired and developed university students from company outset.
MAD Skills, the title of a presentation by Brian Dolan and UC Berkeley professor Joe Hellerstein, is a slang reference to “talent, multi-dimensional expertise, or skills that our extraordinarily better than most.” In the context of data science, MAD represents Magnetic (i.e. attractive), Agile and Deep. In contrast to traditional BI, MAD is more focused on statistical questions of the data. Indeed, Hellerstein has developed madlib.net, an open source machine learning and statistical library that’s embedded in Postgres and Greenplum databases. Faculty of both Berkeley and Stanford are collaborating on additional MAD open source projects.
If I were starting my career in 2011, I’d look long and hard at data journalism as a point of departure. Technology writer Marshall Kirkpatrick moderated a discussion panel that included The Guardian’s Simon Rogers and Jer Thorp of the New York Times. Kirkpatrick deftly demoed Needle, a “revolutionary platform for acquiring, integrating, cleansing, analyzing and publishing data on the web.” Rogers illustrated The Guardian’s data emphases with geocoded visualizations of World War II air raids in London. He also highlighted work on assembling and analyzing expense records of Members of Parliament in Britain. Ther introduced an internal visualization tool that allows the Times to analyze how its content gets shared, illustrating with data from a blog by economist Paul Krugman.
I’ll soon post another Strata blog where I look at recurring conference themes and suggest improvements for V2.