Several top themes emerged from the recent Strata & Hadoop World conference in San Jose, including that of the use of data to improve life itself, from healthcare, to housing, to a dozen social issues in between.

Prakash Nanduri, CEO at Paxata, shared his observations and delight with this theme with Information Management. He also spoke about the full maturity of Hadoop in the market, and the wide-scale use of business intelligence and analytics among organizations today.


Information Management: What are the most common themes that you heard among conference attendees and how do those themes align with what you expected?

Prakash Nanduri: One of the most surprising themes I saw throughout the conference was “Data for good.” Very interesting work is being done to predict Ebola outbreaks, ongoing research about diabetes, mapping of low-income housing to those in need.

It’s always expected that you will attend sessions about banks using data to reduce risk or retail companies using data to find more customers, but it was very encouraging to see so much work being done by data scientists to improve lives.


IM: What are the most common data challenges that attendees are facing?

PN: The market for data analytics continues to grow – there is a BI application or tool for every use case. The only challenge there might be in understanding how these tools differ and whether or not the business needs them all.

On the data management side, it feels like those who have made the move to Hadoop for a modern data collection and storage architecture want to now make all the rich data they have available to their business teams. Their usual approaches for doing that are not keeping up – first, having to take data out of Hadoop and put it into traditional data warehouses defeats the purpose and even if that was economically feasible, the speed and volumes of data just break every traditional process.

People are looking for ways to understand the data without having to build schemas, and they want to shape it, clean it and work with it without creating month-long IT projects. Without a doubt, user-driven data preparation, big data discovery and real-time big data analytics were all hot topics, which signals to me that everyone is trying to figure out how to make better use of their data.

The other interesting thing I noticed is that everyone is struggling with data preparation and big data discovery. I think the perception is that the problems are faced by just business people and that data scientists have magic tools to make this easy, but clearly, they would rather not be doing this work if they could instead be doing high value analysis.


IM: What are the most surprising things that you heard from attendees regarding their data management initiatives?

PN: Hadoop has grown up. I heard fewer discussions about Hadoop as a science project being run by a small team, and more dialog about operational systems that are getting hooked to business initiatives. For example, a forward thinking team who is doing a ton of security analytics off all sensor data being collected in Hadoop. They are establishing baselines they could never do before because they now have two years’ worth of data to help them establish patterns, anomalies and outliers.

The other interesting thing I noted was the maturity of the ecosystem as a whole. For example, Cloudera is now positioning itself as the main curator of open standards in Hadoop, with a track record of bringing new open source solutions into its platform (such as Apache Spark, Apache HBase and Apache Parquet). Very exciting times when it comes to the entire data management lifecycle.


IM: What does your company view as the top data issues or challenges in 2016?

PN: Our customers ask questions like “how can we turn everyone in the company into an analyst?” “How do we build an information-driven culture, from the systems to the processes to the people?” It’s easy to buy systems but so often, but the hard things are not tackled…like “do users have data in context of the questions they want to ask? If not, what does it take to do that? Do they have the skills or tools to do that?”

Well-intentioned companies say they want to be information-driven but ignore big issues, which ends up frustrating their people, breaking processes and creating data chaos and frustration instead.


IM: How do these themes and challenges relate to your company’s market strategy this year?

PN: We are always looking for ways to move faster for our customers. In order to do that, we have to minimize the friction they have working with data, so they get to clean, contextual, complete and consumable information faster.

Those words sound simple but there is a lot of machine learning we build into our system to make these things happen:

Clean – there are two elements to address here: syntactically and semantically clean data. Syntactically clean data ensures that certain rules are applied consistently to the data (i.e., acronyms for states, use of middle initial versus middle name, etc.), where semantically clean data ensures that information is accurately represented (i.e., a city in the data set is actually a city). In order to keep data clean, these elements require ongoing collaboration, not only among the business users but between the business teams and IT.

Contextual – this is the ability to drill into the data as you explore different ways to answer the question. For example, we may start with a customer segmentation exercise to identify which audience is spending the most on our products. Once we discover high net-worth individuals, we move into a customer targeting exercise, which requires us to bring in demographic data to give us additional properties for targeting that audience. Along the way, we don’t want delays to stop our exploration so we must have the ability to dynamically bring in data to drive richer context.

Complete – this means you have all the data required to answer your question. Getting back to our example, we would need to combine our customer-spend data (how much do they buy from us) with third party data about what customers spend overall (what is their total wallet share). With this comprehensive view, we can determine how much of the market we are truly capturing.

Consumable – this means making the data available in whatever tool the business person wants to use. Data should be delivered in a flexible format that can be brought into everything from standard Excel spreadsheets to applications used for ad-hoc analysis and visualization. Again, we don’t want to cause friction with delays in getting answers because the data is not easily digestible by those asking the questions. As business analysts, data scientists, curators and IT teams start to embrace a unified approach to these issues, their next set of questions will be around scalability, security, governance and collaboration as key elements of the modern data preparation stack.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access