I sat in the back of the cavernous meeting hall for the second day of Strata + Hadoop World keynotes.  And good thing I did, unknowingly minimizing the risk of being meat for author, radio host, and comedienne, Paula Poundstone.

Her “presentation”, Nonsense Science, was really nothing more than a comedy club routine, though it was far and away #1 in my talk rankings. Pity the poor dupes she chose to “pound”, including an O'Reilly planner, an IBM blogger, and a Cloudera sales rep.  Poundstone lambastes the ascent of the “flat thing”, Siri, bloggers, the ubiquity of Google, and the Cloud  – in no uncertain terms, defending the use of f-bombs along the way. Spend 23 minutes enjoying her routine.

Who better to celebrate the 10th anniversary of Hadoop than early driver, now Cloudera Chief Architect, and Apache Software Foundation board member, Doug Cutting. In the 10 years post-Hadoop, hardware has become commoditized and development methods become more iterative. Software platforms have trended to open source from enterprise proprietary – and the two communities are confluencing around standards.

The Hadoop ecosystem has been the beneficiary of these trends, while simultaneously setting the standard for future work. Now Spark is replacing MapReduce as the compute engine for many, though they coexist painlessly – and will continue to mutually thrive with the emergence of new “competitors”. Indeed, expect improvements like this going forward as all industries become data-driven to their core. Open source will  fuel the software stack, even as solid state storage and other improvements advance hardware. And of course the hardware and software stacks will increasingly adapt to Cloud disruption.

Presentations highlighted people learning as well as machine learning. Julia Galef of the Center for Applied Rationality, noted that educated and successful people often make predictable errors in judgment, misinterpreting evidence and not learning from mistakes. Her organization purports to redress this shortcoming by turning cognitive science into cognitive practice. One best practice? Develop Bayesian habits of updating probabilities of beliefs in response to strength of evidence.  Look for inconsistencies in evidence and ask what we'd expect to see if our hypothesis of the state of the world is false. I can think of many people who'd benefit from Galef's recommendations.

AMPLab Director Michael Franklin discussed the latest directions for BDAS,  the Berkeley Data Analytics Stack, a top to bottom DS platform and  progenitor of Spark. Among current initiatives are Succinct at the storage level for queries on compressed data; Velox for low-latency, personalized model serving; KeystoneML, an interface for users who wish to simply specify the what, not how of ML modeling– much like SQL accomplishes for data access; and AMPCrowd for data cleaning. The latest personal direction for Franklin? Leaving his Cal, Berkeley professorship for a position at the University of Chicago.

Microsoft's Joseph Sirosh teamed up with Stanford neurologist/neuroscientist  Kai Miller in a sci-fi-like presentation Connected Brains in which the Cloud and machine learning help to identify, capture and classify signals using sensors on the brain surface to determine perception and thought. In Miller's world, neuroscience and data science collaborate to provide the foundation for a new generation of prosthetics. He thinks it not far fetched to soon be able to re-wire and rehabilitate an injured brain, and feels big data, machine learning, and the Cloud will fundamentally enable that dream.

There are clever applications of data science, there are mind-boggling applications of data science, and then there are haunting applications of data science. Megan Price of the Human Rights Data Analysis Group discussed using DS to come up with accurate estimates of war crimes. The group's charter is to convincingly establish the number of victims in war areas such as the former Yugoslavia or Syria.

The first step in the HRDAG methodology includes heavy-hitting data wrangling methods of data linkage and duplicate detection, using algorithms such as hierarchical clustering to help identify unique victims. ML techniques also assist in consolidating disparate victim lists into a more accurate aggregate.  One finding: it's at the most violent times where the use of statistical analysis is most critical.

Venture capitalist Michael Dauber, along with data scientists Yael Garten, Monica Rogati, and Daniel Tunkelang, participated in a round table discussion on how to build data science teams entitled “Data science teams: Hold out for the unicorn or build bands from steeds?” Acknowledging that DS challenges often entail healthy doses of engineering and business analyses, their responses to the question of what skills were most important involved a heavy investment of “it depends.”

Ditto for the structure of internal organization – separate groups or part of larger product teams – that best promote the  DS function. There did seem to be consensus that generalists are better suited to the early stages of data science within an organization, while the need for specialists grows with the size and maturity of the group.  Tunkelang struck a chord with his observation that computational social science is a good breeding ground for aspiring data scientists, but I was a bit surprised that the group was not familiar with “citizen data scientist”.

Attendance this time in San Jose was in excess of 5,000, in contrast to the slightly more than 1,000 at the 2011 show in Santa Clara. Strata + Hadoop World seems now a franchise that shows little sign of plateauing. See you next year.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access