While there was strong interest in the Spark and Kafka data analytics platforms at this year’s Strata World conference in San Jose, CA, the central theme was all about Hadoop, of course.

Many show attendees wanted help with real-time data management, a theme that was shared with Information Management by several exhibitors at the event. Arvind Prabhakar, CTO and cofounder of StreamSets, shared his thoughts on what data professionals are struggling with.


Information Management: What are the most common themes that you heard among conference attendees and how do those themes align with what you expected?

Arvind Prabhakar: There were two trends we picked up on at the show. First, the move towards real-time use cases in Hadoop, with a specific need to find a way to pre-process the data so it is consumption-ready when it lands in Hadoop. Specific hot technologies for enabling this were -- not surprisingly -- Kafka and Spark.

Second, there was a lot of discussion about data architecture modernization. Specifically, once Hadoop use cases are in production there is recognition that too much time and energy are being spent on maintaining data ingestion pipelines that have been implemented using low-level frameworks like Flume, Kafka, Logstash, and others.


IM: What are the most common data management and data analytics challenges that attendees are facing?

AP: The variety and variability of big data sources. Each new source requires substantial effort to integrate into the data processing architecture and, once integrated, something inevitably changes upstream that creates a rework cycle.


IM: What are the most surprising things that you heard from attendees regarding their data management initiatives?

AP: It is surprising to see how data movement is not actively considered or planned for at the time of big data project inception, and enterprises assume that the low-level frameworks will suffice. Maybe this is a consequence of the business driving Hadoop without sufficient weigh-in from IT implementers. When they run into failure scenarios, they then realize how brittle and ad-hoc such data movement can be and have the knee jerk reaction of questioning the entire project’s investment, rather than shoring up this one weak link in the architecture.


IM: What do you view as the top data issues or challenges in 2016?

AP: At the risk of being redundant, the ability to feed Hadoop continuously and efficiently is stifling a lot of projects. This is not just an issue around data availability and completeness, but data quality and integrity, since big data source systems are not as tightly governed as in the world of traditional data, and the need for real-time analytics is increasing which raises the cost of availability or quality issues.


IM: How do these themes and challenges relate to our company’s market strategy this year?

AP: StreamSets was founded to tackle the challenge of data movement in a big data environment marked by constant change. Our key theme this year is the need for performance management of data flows.

We will continue to highlight the linchpin importance of ingest infrastructure to the end-to-end system and dangers in relying on low-level pipelines that are difficult to maintain and offer no operational visibility to the quality of the data flow. This is an unsustainable situation which will stop big data projects in their tracks and damage the reputation of big data in the enterprise.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access