Much to my chagrin, I had to return home to frigid Chicago late Friday night from three days of delightful weather respite at Strata + Hadoop World in San Jose. I've been to each of the five winter Stratas in Silicon Valley, this the first outside the Santa Clara Convention Center. As well it needed to be, what with over 4,500 registrants, more than triple the number of the inaugural.

On “training” Wednesday, my track of choice was “Hardcore Data Science”. Among the heavy academic-like presentations were talks on machine learning advances in continuous speech recognition, visual understanding, the search for repeated structures in time series data, graph mining, and tensor methods for large-scale unsupervised learning. A later Netflix talk, “How to Detect Anomalies in High Cardinality Dimensions and Make The Actionable”, featuring robust principal components analysis, was also outstanding. My mathematical head was spinning.

Facebook data scientist and R author John Myles White lambasted Python and R, probably the most popular DS languages, for design flaws like “over” dynamism that inhibit performance. White counters that Julia, an MIT-developed open source computational language with mature type inference that generates run-time machine code, is much preferred for computing with sizeable data. A year ago, I played with Julia and liked it – though its paucity of community involvement scared me off.

UC Berkeley statistician and AMPLab scientist Michael Jordan weighed in on the statistics-data science divide with his dynamic talk, 'On the Computational and Statistical Interface and “Big Data”'. Jordan's take is that, in the absence statistical control, big data can lead to big noise interpreted as big signal. With billions of records and millions of features, “I will find some combination of columns that will predict perfectly any outcome, just by chance alone. ...So it’s like having billions of monkeys typing. One of them will write Shakespeare.”

To Jordan, computer science is wanting in consideration of risk – the error bars -- while statisticians lack focus on runtime performance. The answer is, of course, the rapprochement of statistics and computation – i.e.  data science. As an illustration of such coordination, Jordan offers the “bag of little bootstraps” (blb), that elegantly combines statistical and computational contributions for solutions to big data inference.

I wish I'd seen Michelangelo D'Agostino's pertinent presentation, “The Two Cultures of People Science”,  before I'd posted my last blog “Predictive Analytics or Data Science?”. D'Agostino contrasts the methods and approaches of  “data scientists” with those of “social scientists” on project teams he manages. The social scientists are more concerned with theory/hypotheses, methodology, surveys, experimental design, inference, and causal analysis than the data scientists, who obsess on data exhaust,  A/B testing, machine learning, out-of-sample predictions, and software. The distinction reminds me of the characterization of statistical science by Stanford professor, Brad Efron, in an interview a few years back: 'If data analysis were political, biometrics/econometrics/psychometrics would be “right wing” conservatives, traditional statistics would be “centrist,” and machine learning would be “left-leaning.” The conservative-liberal scale reflects how orthodox the disciplines are with respect to inference, ranging from very to not at all.'  D'Agostino's players have put their egos aside to make cohesive teams where the statistical whole exceeds the sum of the parts.

The Spark system for analytic cluster computing was ubiquitous at the conference. There were end-to-end tracks each of the three days, in addition to presentations by companies, consultancies, and Spark-related vendor Databricks. Frenetic MIT professor and Databricks executive Matei Zaharia keynoted “New Directions for Spark in 2015”, observing that Spark continues its progression as a big data/analytics platform of choice and is now the leader in Apache project contributors. Spark scientists can look forward to Python and R-like dataframe support in soon-to-be-released version 1.3. 1.4, to appear this summer, will offer an API to R.

Kurt Brown provided the 2015 update to his last year talk on the Netflix data and analytics platform. Brown calibrated the Spark hype, opining that the 2013 version was unusable but that 2015 has brought big improvements and a subsequent enhanced role at Netflix. On the plus side, he tags Spark as a cohesive big data/analytics environment with multiple language support, substantial performance gains, a strong community, and a sanguine future. The current negatives include immaturity, multi-tenancy/concurrency concerns, problems with shuffling/cascading, tuning, and current investments in pig-Python versus Scala. Overall, though, Brown's very bullish on the future of Spark at Netflix.

DS for public good was also a major theme at the conference. Luminary data scientist DJ Patil was recently appointed chief data scientist for the federal government. Introduced in a recording by President Obama, Patil extols the data-driven culture of the current administration, noting as a source of data sets and citing an executive order mandating that all go-forward government data will be open and machine readable. His priorities include precision medicine, building government data products, and promoting responsible data policy. Ultimately, “Data-driven government responsibly gathers, processes, leverages, and releases data in a timely fashion to enable transparency, create efficiencies, provide security, foster innovation.”

Berkeley Public Policy professor Solomon M. Hsiang regaled the audience with a fast-paced empirical analysis of the relationship between global climate change and economic/political instability. His take is that computation and data-driven policy design is a major tool of innovation in the technology of governance. The good news is that data-driven public policy appears to be an emerging field in academia.

By far the most entertaining vendor keynote was “Connected Cows?” by Microsoft's Joseph Sirosh.  Dairy cows from farms in Japan wearing pedometers have their steps monitored in a cloud-based system to predict disease and estrus. A service then analyzes and sends pertinent alerts to participating farmers.

It turns out that changes in female cow night pacing are a reliable indicator of optimal mating times in their very short cycle. Moreover, for cows mated early in that identified period,  female progeny are more likely, versus males in the latter stage. Artificial intelligence meets artificial insemination.

Strata-Hadoop World Silicon Valley V is in the books. I look forward to next year.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access