Commercializing Enterprise-Grade Hadoop: Tools for Harnessing Petabyte Analytics
Hadoop is riding the hype wave right now. You’ll find many IT professionals who know just enough about Hadoop to be dangerous in a cocktail party setting, but not enough for their own comfort to respond to grilling from the chief technology officer or the geekier business executives.
If you’re slightly bewildered by all the buzz over this new technology with the funny-sounding moniker, you’re not alone. The official story is that Hadoop was the name of the inventor’s kid’s stuffed elephant. However, for most IT professionals, it could easily be an acronym for "Heck, Another Darn Obscure Open-source Project." The fact that Hadoop, managed by Apache, includes subprojects with similarly opaque names--such as Pig, Hive, Chukwa, and ZooKeeper—contributes to the queasy feeling that this is an untamed menagerie of squealing beasties.
And if you’ve pegged Hadoop as an advanced analytics initiative to mine petabytes of unstructured information, prepare for further bewilderment. The Apache Hadoop project states that it develops open-source software for “reliable, scalable, distributed computing.” Yes, that’s true, but the better-informed among you may be puzzling over the linkages that people often draw between Hadoop, in-database analytics, and MapReduce.
So what exactly is Hadoop, and, just as important, why should enterprise analytics professionals care? As I discussed it in my Forrester report late last year on in-database analytics, Hadoop is primarily for advanced analytics in cloud environments. At heart the Apache Hadoop project defines an analytic processing pushdown workflow model and distributed analytic file store for analyzing unstructured information sets. But the range of subprojects goes well beyond that to encompass a distributed columnar database, data warehousing infrastructure, MapReduce interface, ad-hoc query language, data collection capability, and other utilities for development and management of Hadoop clusters.
Consider Hadoop the cornerstone of the sprawling collection of next-generation technologies known as “No SQL.” Hadoop, like many No SQL technologies, is still primarily an open-source community and has not yet made the critical transition into a mature enterprise-grade analytics market segment. However, Forrester has seen increasing incorporation of Hadoop technologies and interfaces in recent months into the solution portfolios enterprise data warehousing (EDW) vendors—most notably, Teradata, IBM, Aster Data, Greenplum, and Vertica—and business intelligence (BI) vendors such as Pentaho and MicroStrategy.
We see most startup activity in the Hadoop space—from companies such as Cloudera and Datameer--focusing on solutions that are completely open: open source, open standards, open architectures, and open to service-oriented integration into any SaaS/cloud-based analytics environment. If you want to check out the range and maturity of Hadoop adoptions built on open-source codebases, visit the Hadoop “powered by” wiki.
Note how many of these deployments are for inline analytics, natural language processing, sentiment analysis, social network analysis, social media monitoring, semantic search, and intelligent Web harvesting—in other words, all the core applications that Forrester includes under advanced analytics. Many of the first-generation Hadoop solution providers have tended to disparage traditional EDWs as unsuitable for the extremely complex content, rules, and models that characterize most Hadoop applications.
Though there is some kernel of historical truth to this viewpoint, many, if not most, of the leading EDW vendors have addressed these limitations in the past few years through a renewed emphasis on in-database analytics, complex content, predictive analysis, and MapReduce. Some EDW vendors (listed above) have even taken the baby step of incorporating interfaces from their platforms to external Hadoop clusters. However, no traditional EDW vendor has, as yet, built Hadoop’s distributed unstructured filestore (Hadoop Distributed File System) or structured columnar database (HBase) into their core architectures. Forrester expects that at least 2-3 of the leading EDW vendors will begin to go down that road in the coming year.
Is Hadoop a stand-alone market in the advanced analytics arena? At Forrester, we don’t believe, long term,that a stand-alone Hadoop platform/tool segment will emerge. This technology is a natural extension to vendors’ and their users’ EDW, BI, in-database analytics, petabyte scaling, and other advanced analytics strategies.
Nevertheless, we’ve seen recent launches of enterprise-grade Hadoop tools both from startups and from established analytics companies such as IBM, Pentaho, and MicroStrategy. We see other vendors of commercialized Hadoop tools coming along, most of them coming from the hugely innovative open-source and consulting community that sprung up around the Yahoo- and Google-pioneered core cloud-oriented technologies. Many of these pure-plays will undoubtedly be acquired by leading EDW, BI, and advanced analytics in the next 2-3 years as they prove out their capabilities.
Most of the Hadoop startups are trying to reinvent the EDW for the new age of complex content and predictive analytics in the cloud. They are adding query, exploration, modeling, metadata, workload management, job scheduling, cluster administration, security and other features to appeal to enterprises and service providers who need a robust open-source software platform. You can best believe that all the established EDW vendors are paying close attention to these efforts and tuning their development and acquisition strategies accordingly.
To date, no traditional enterprise has built their entire advanced analytics strategy on Hadoop or its kindred framework MapReduce, though increasing numbers of nouveau Web 2.0 service providers have oriented their business models around it. As commercial, robust Hadoop platforms emerge over the next several years and are integrated into users’ core EDW platforms, you can expect to see the Hadoop footprint in the average enterprise grow—especially with users who have outsourced much of their analytics to the cloud.
But Hadoop is still very much a futures discussion that the average EDW professional should monitor as it moves inexorably toward maturity. As unstructured content, social media, and petabyte scaling move into your core BI strategies, you’ll need to familiarize yourself with this important new platform for analytics-driven business.