Turning Hadoop Into an Analytics Platform for the Enterprise
In a previous article, I gave my view of why the programming-framework is here to stay, and how its momentum is driving acceptance in the enterprise as investment in modern data platforms continues to swell. In this piece, I’ll drill down into specifics of how Hadoop can be used as a valuable business intelligence tool for enterprise organizations.
Enterprise big data implementations are estimated to be a $38.4 billion market this year(1), but despite increased IT spending, the majority of Hadoop big data projects are still in the development lab. Even highly technical organizations frequently have to develop in-house solutions in order to make Hadoop valuable(2). Enterprise data is going into the Hadoop data lake but organizations are often not generating value from it. That’s a challenge that organizations need to get ahead of, as Gartner predicts that “by 2017 most business users and analysts in organizations will have access to self-service tools to prepare data for analysis” as power shifts from IT to business analysts(3). Companies that are able to create more widespread access to the power of Hadoop across their organizations stand to reap huge gains.
For example, at Actian we worked with a leading sports apparel retailer to leverage data analytics in order to better understand their customers and maximize sales opportunities. The retailer acquired a fitness-tracking platform with the goal of analyzing intelligence about the routes, activities, diets and social network graphs of its customers, but it wasn’t long before it accumulated more data than its CRM system could handle. The retailer turned to Hadoop, which become both the answer AND the problem, given the company’s lack of data scientists trained to analyze Hadoop data. Similarly, a leading bank found that by leveraging Hadoop they were able to process twenty times more data than was possible with their previous database. This bank could now query 200 billion rows of data and invest $100 Billion float in just 28 seconds—as opposed to the 3 hours it once took—while instituting multi-level risk exposure analysis for controls and regulatory compliance across the organization. But in order to enjoy those benefits they also had to make specific investments.
First, Hadoop needs to be made accessible by the simple but powerful SQL. According to Forrester, enterprise adoption of Hadoop is now mandatory, and “for those familiar with business intelligence, SQL on Hadoop will open the door to familiar access to their data.”(4) What the above cases illustrate is that, despite the expensive skillsets, long and error-prone implementation cycles, lack of native support for popular BI tools and inadequate execution times, Hadoop remains a powerful framework for big data and a vital technology for the data-driven business. Would Hadoop implementations be more successful if they had a more direct connection to the C-Suite? Could the technology be leveraged to drive even more value from existing BI and analytics tools? The more data, and the greater diversity of data used to feed those BI tools, the more intelligent they become. SQL must be at the center of this conversation.
The BI tools exist and most businesses already have a conduit for data and analytics: a business analyst. But in that conduit also lies the barrier, as most analysts rely on SQL, the lingua franca for managing relational database systems, to query data and perform analysis. Until now there has been little success in using SQL to access data in Hadoop, as most SQL in Hadoop products on the market have very limited SQL capabilities. The appetite for Hadoop-driven BI is there, but the gap between the data itself and those actionable results must be bridged. It’s time to remove the barriers to Hadoop data, move projects into production and start realizing value from leveraging Hadoop within enterprise information architectures. To do this successfully will require technology that is designed to overcome Hadoop’s immaturity, and delivers open data access, high-performance and enterprise-grade capabilities.
Hadoop offers a huge opportunity for enterprises to capture and derive insights from big data and ultimately drive transformational business outcomes, given its ability to handle large data sets. To give just a few examples, Hadoop can provide enterprises with a competitive advantage through faster insights, bolster their bottom line by reducing risk and fraud and uncover customer desire for new products and services. Hadoop has already helped to address the issues of cost and size for data storage—all that’s left is to turn it into a data analytics platform that can deliver for the enterprise. To get there, Hadoop’s technology shortcomings need to be overcome. Fortunately, recent advances like YARN (Yet Another Resource Negotiator) make it possible to decouple MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.(5) With the MapReduce skills hurdle out of the way, the race is on to remove the complexity of analytics in Hadoop so that organizations can tap into it using standard BI tools.
SQL has long been the standard trade language for those who work with data and databases. Oracle SQL Developer alone receives over 3.5 million downloads annually(6) , compared to an estimated 150,000 MapReduce programmers worldwide. Additionally, nearly all enterprise applications and business analyst tools use SQL to manage data, so why use anything else?
The benefits are simply too great to overlook. Enterprises can use existing SQL-trained users instead of spending time and money to train from the ground up or hunt down elusive and expensive talent. Using SQL also opens the door to existing business intelligence and visualization tools as well as existing dashboards and reports. Existing SQL applications and queries don’t have to be rewritten to access Hadoop data, and the data itself does not have to be brought out of Hadoop in order to use it. In the same vein, duplicating Hadoop data is no longer necessary, assuring the cost effectiveness of the technology.
SQL on Hadoop Options
There are many solutions attempting to deliver SQL on Hadoop capabilities. These can be divided into three camps:
● Marketing Jobs (SQL Outside Hadoop): Employs both Hadoop and a DBMS cluster and uses a connector to pass data between the two systems.
● Wrapped Legacy (Mature But Non-Integrated): These solutions take existing SQL engines and modify them so that when generating a query execution plan they can determine which parts of the query should be executed through MapReduce and SQL operators.
● From Scratch (Integrated But Immature): This approach builds SQL engines from the ground up to enable native SQL processing of data in HDFS while avoiding MapReduce.
SQL in Hadoop
SQL in Hadoop is a relatively new category allowing for high-performance data enrichment and visual design capabilities without the need for MapReduce skills. Users can build and test models with data mining, data science and machine learning algorithms. Putting these into production is made simple with common SQL tools for business intelligence.
Turning Hadoop into an Analytics Platform
Regardless of the approach taken, with increased adoption of modern data platforms, the need to analyze large volumes of data in Hadoop and deliver scalable SQL access to that data will only grow. Modernization is critical in creating transformational value out of big data, and failing to keep up can put any enterprise’s long-term position in jeopardy. Organizations need to think about real, actionable methods they can use to make Hadoop work for them as a BI tool. To choose not to is to risk being left behind.
Actian CTO Michael Hoskins directs Actian’s technology innovation strategies and evangelizes game-changing trends in big data, analytics, Hadoop and cloud. Mike received the AITP Austin chapter's Information Technologist of the Year Award for his leadership in developing Actian DataFlow, a highly parallelized framework to leverage multicore. Follow Mike on Twitter: @MikeHSays
1 Big Data Vendor Revenue and Market Forecast 2013-2017, Jeff Kelly, 12 February 2014
2 BNY Mellon Finds Promise and Integration Challenges with Hadoop, Rachael King, The Wall Street Journal, 5 June 2014
3 Gartner Says Power Shift in Business Intelligence and Analytics Will Fuel Disruption, Gartner, 27 January 2015.
4 Forrester’s Hadoop Predictions 2015, Mike Gualtieri, Principal Analyst at Forrester Research, 4 Nov 2014
5 TechTarget’s Apache Hadoop YARN (Yet Another Resource Negotiator)by Margaret Rouse, Dec 2013
6 Oracle SQL Developer: What People Are Saying…, Oracle Corporation, May 2014