One would obviously expect Hadoop to dominate the discussions at the recent Strata & Hadoop World conference in San Jose, CA. But much of the buzz this year was around Apache Spark, and how Spark might fit into the data management strategies of many organizations.

Arno Candel, PhD, Chief Architect of, shared his observations with Information Management on what conference attendees were most interested in, and how those needs are influencing his company’s go-to-market strategies.


Information Management: What are the most common themes that you heard among conference attendees and how do those themes align with what you expected?

Arno Candel: Many of the people I spoke with were interested in how Spark can, or would, fit into their overall data management and analytics strategy. While we at have been seeing increasing interest in Spark, which was one of the reasons that we built out Sparkling Water, our Spark API, I’ve always thought of Strata as a Hadoop conference - it is after all merged with Hadoop World.

It’s now clear that data storage is essentially a solved problem, while in-memory analytics and machine learning are driving most of the ongoing work in the field. We see ourselves as very much aligned with this trend.


IM: What are the most common data challenges that attendees are facing?

AC: Turning data into actionable insights has been, and remains, a key challenge for many organizations. Everyone has been told that they need to store more and more information in data stores like Hadoop, but there is often a lack of a plan for the “day after.” What do organizations do once they’ve stored all their data in a data lake? They realize that they need some kind of analytics strategy, but aren’t sure exactly what that should look like.

In addition, there is a huge problem with regards to data cleansing; much of the data that organizations have stored is messy, has missing variables, etc. and organizations need to find a way to deal with that. Also, many existing machine learning solutions don’t scale well to large datasets, or don’t have enough features to be practical. That’s where we put a lot of emphasis at


IM: What are the most surprising things that you heard at the conference?

AC: What’s heartening to me was the extent to which organizations expressed a willingness to try new technologies and to support open source alternatives. There was a time when the enterprise viewed open source technologies with suspicion and felt that they would be a risk to their existing IT infrastructures. I think the growth of events like Strata, Hadoop World, Spark Summit, etc. proves that’s no longer the case. Open source is the new default.


IM: What does your company view as the top data issues or challenges in 2016?

AC: Operationalizing data science is our overwhelming focus. I think people realize that there just aren’t enough qualified data scientists to go around - we have to be more efficient with their time.

Between the DevOps and data cleansing work that data scientists have to do, they’re not spending most of their time doing actual data science. Data scientists aren’t happy because they have to do so much work on the backend and developers aren’t happy because they have to wait ages for data scientists to add machine learning algorithms to their applications.

We are working on bridging the gaps on both sides: making data scientists more productive and giving developers access to fully automated machine learning solutions.


IM: How do these themes and challenges relate to your company’s market strategy this year?

AC: We plan on addressing the challenges that data scientists and developers face head-on. Matt Dowle, the author of the popular R package ‘data.table’ package for data munging has joined and is helping us build out H2O’s munging capabilities. Leland Wilkinson, the author of The Grammar of Graphics has also joined to help improve H2O’s visualization capabilities. In addition, we’re working on developing Steam, our data science hub that will take the speed, scalability and accuracy of H2O and combine it with new and improved functionality designed to automate the bulk of the DevOps work that data science requires.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access