The State of Hadoop for Big Data Scientists

Register now

I just landed at Strata+Hadoop World in New York -- a major gathering of data scientists and information managers within the big data ecosystem. My mission: Determine the state of Hadoop for data scientists.

I'll leverage a range of meetings to determine the state of Hadoop. I'll be sure to update this blog after each meeting, recapping the conversation for data scientists and IT executives that are mulling a range of big data platforms upon which to build new applications. I'll also be gathering channel partner insights for my daiy blog on ChannelE2E. Here's a look at my expected schedule.

Meeting 1: MemSQL CEO Eric Frenkiel

The database-centric company just launched Streamliner, which allows customers to gather real-time data pipelines for real-time analysis. It works with Spark, though MemSQL is focused on a range of technologies beyond that open source option. An example customer: An energy company in Oklahoma is using Streamliner and MemSQL to monitor very expensive drill bits during the fracking process.  Drilling adjustments (based on sensors that monitor the bit's performance, temperature and other real-time analytics) can be made instantly.

Meeting 2: Trifacta Co-founder and CTO Sean Kandel, VP of Marketing Joe Scheuermann and Director of Product Marketing Will Davis.

The company's focus is simply explained: Before you can effectively analyze data and perform analytics -- you need to prepare the data, weed out or fix anomaly information (was a phone number field filled with a zip code?), and ultimately enrich the data. Or, you may need to combine and cleanse data from multiple sources.

Trifacta has about 40 customers -- many of them are big names (Orange Telecom, GoPro, Pfizer, Pepsi Co., P&G and more). And most of the focus, currently, is the Hadoop ecosystem.

Meeting 3: MapR VP of Marketing Jack Norris.

MapR is one of the leading providers of Hadoop. The company’s latest move involves OJAI (the Open JSON Application Interface) — which essentially allows developers to write and adjust Hadoop applications far more recently. Norris presented sample code to me the old way — and then showed me the alternative code leveraging OJAI.

Basically, OJAI required fewer commands and fewer lines of code to deliver far more application capabilities. Developers can test OJAI now, with general availability expected later this year.

Meeting 4: Hortonworks VP of Product and Alliance Marketing Matt Morgan

Hortonworks is another leading provider of Hadoop. Much of the company’s focus at the conference involves the Internet of Things (actually, the Internet of Anything) and sensor networks. Hortonworks’ challenge: How to manage edge devices and sensors so (A) they gather the right information and (B) send only the required information over the wire back to the corporate data lake/Hadoop storage system?

The answer involves Hortonworks DataFlow — a new offering powered by Apache NiFi. “HDF is designed to make it easy to automate and secure all types of data flows and collect, conduct and curate real-time business insights and actions derived from any data, from anything, anywhere,” the company asserted.

Overall, it sounds like Global 2000 companies continue to increase their on-premises consumption of Hadoop.

Plus, public cloud providers like Google, Microsoft and Amazon are promoting a range of new Hadoop and big data tools. And on-premises equipment providers like Cisco Systems, EMC, IBM and others are striving to promote converged infrastructure (compute, storage, network) for big data Hadoop applications.

For reprint and licensing requests for this article, click here.