After all the hype around Apache Hadoop, we are starting to see some successful implementations that highlight its benefits. However, we do not yet see the kind of enterprise adoption typically associated with a technology that promises to manage high volumes, reduce implementation costs and, more importantly, monetize data. The question is why?
The increased volume, variety and velocity of “big data” have strained the conventional data processing technologies used for the data management life cycle. In response, many solutions have evolved to address these challenges. One key technology is Apache Hadoop, open source software used for reliable and scalable distributed computing.
In spite of the potential of Hadoop, however, most companies continue to approach it cautiously, evaluating its suitability as an enterprise solution, although the concepts of Hadoop MapReduce have been around from the early 2000s.
To assess Hadoop’s suitability for enterprise-wide adoption, it should be judged against the characteristics that help a technology (such as extract, transform and load) gain acceptance as a mainstream enterprise technology.
Characteristics Required for Enterprise-Wide Adoption
To become widely accepted and achieve an established presence within the enterprise, a technology should:
1. Be proven: A proven solution addresses use cases effectively and efficiently. For example, in a reporting scenario, a focused data mart with large numbers of standard and pre-built canned reports is the proven approach as opposed to trying to tax the data warehouse or operational data store.
2. Be based on a design paradigm: IT solution architects more readily embrace solutions based on designs that are an extension of an existing approach. For example, in the ETL world, the combination of row-based processing and rich user interface made understanding the solution and developing expertise in the technology relatively easy.
3. Be a solution enabler: Solution enablers are tools or frameworks that support a design paradigm. This drives a technology solution’s pace of evolution from a theoretical concept to mainstream production. A stable integrated development environment with tools to debug, monitor performance and manage workflows in addition to accelerating code development is a great enabler for a technology. These help the developers focus on application logic rather than non-functional challenges.
4. Offer good value: A technology is much more liable to be adopted and receive further investment if it performs well against these measures:
- Break-even analysis
- Return on investment
- Addition/reduction to total cost of ownership
ETL as a Mainstream Technology
Before detailing Hadoop’s progress, let’s evaluate ETL based on the characteristics outlined above.
1. Is it proven? The migration of data from the source system to target system was a common use case for enterprise systems. As the number of source systems, data types and required standardizations increased, there was a need for a metadata-driven approach, which was performed by the ETL toolset. Along with this core functionality, these tools have extended their capabilities to workflow management, data quality and profiling. It’s now hard to imagine having a data integration solution without an ETL toolset.
2. Is it based on a design paradigm? All along, developers wrote custom code to move data from source to target. ETL tools used a metadata approach to simplify and extend this activity that resulted in greater technology comprehension and acceptance.
3. Is it a solution enabler? ETL boasts a mature tool stack with productivity enhancement features such as user interface-based drag and drop development, reusable routines, prebuilt functions for common transformations and connectors to the most commonly used data sources. Additional functions like data quality checks, metadata capabilities and integration with master data management products have made compelling use cases for wider usage.
4. Does it offer good value? ETL solutions have resulted in a reduction in development and maintenance costs and make it easier to capture the tangible (development, licensing and infrastructure) costs to the organization. This helps technology groups to better evaluate buy versus build options, recommend the optimal design approach and then justify the investment and implementation.
The above-mentioned features evolved and matured over the years, leading to the adoption of ETL as a mainstream enterprise technology.
Hadoop on the Road to Enterprise-Wide Adoption
Now let’s evaluate Hadoop against the same set of characteristics. As more enterprises continue to evaluate using the technology with big data use cases, there has been a continuous evolution of the Hadoop ecosystem.
1. Proven solution: The increasing need for businesses to understand the “why” of data as opposed to the “what” has resulted in an explosion in the quantity of data that organizations gather and store. Conventional data management techniques have been unable to keep up with these volumes because of hard-disk I/O and CPU bottlenecks. Hadoop addresses this problem, demonstrating its potential in the process. The technology is used by such ubiquitous sites as Google, Facebook and Yahoo, giving Hadoop MapReduce tremendous credibility as an enterprise-level solution.
2. Novel design paradigm: Although Hadoop has had a head start in terms of its solution credentials, it adheres to a new programming paradigm (MapReduce) and technique (push data to node for processing) for solving the big data problem. These involve unfamiliar concepts such as mappers, partitioners, combiners and reducers. and require a different mindset from system architects and developers. Hadoop also, eschews conventional approaches for moving large amounts of data, pushing data to node to remove I/O and network latencies instead.. These techniques require a unique skill set, one that isn’t readily available.
3. An immature solution enabler: Hadoop continues to be time-consuming to set up, deploy and use. Tools are needed for abstracting code, reducing complexity and allowing developers to focus on the actual problem as opposed to syntactical nuances, job orchestration and the executing environment. There have been great strides in this direction, but more is needed to encourage widespread adoption.
4. Value in the eye of the beholder: The better a technology’s ROI, the greater the interest in adoption. Hadoop-based enterprise projects broadly fall into two categories:
- • Projects that replace existing physical infrastructure with Hadoop
- • Projects that monetize data
Calculating the ROI for physical infrastructure projects is relatively straightforward, since it involves tangible elements such as hardware, software licenses, development time and operational costs.
Determining the ROI for the second category is trickier given the subjective nature of the calculation. This requires a close partnership with business users and analysts to evaluate potential Hadoop use cases. To get a true reading, potential scenarios should take into account other big data and cloud technologies that eliminate the need for large upfront investments.
What’s Needed for Wider Adoption
For Hadoop to become a mainstream enterprise technology , the following deficiencies need to be addressed:
Guiding principles and reference templates: The growing number of vendors promoting tools, utilities and packages for integration with Hadoop has resulted in a crowded market space. This has made selecting the right big data tool a difficult decision. To help developers and architects sort through the maze, Hadoop needs guiding principles, best practices and reference templates.
Greater diversity in operating systems and programming languages: Currently, most Hadoop frameworks are coded with Java and executed on a Linux OS to take advantage of the abundance of UNIX tools and utilities. But for widescale enterprise acceptance, there need to be implementations on non-Linux platforms like Microsoft coupled with coding in non-Java programming languages.
Platform stability: This is one of the major constraints for enterprise adoption of Apache Hadoop. There were more than 10 Hadoop releases between October 2011 and September 2012. Hadoop MapReduce and Linux benefit from the open source network with rapid advances in functions and code quality. On the flip side, this can also inhibit enterprises from taking the plunge, because there is so much feature flux that application programming interfaces developed for older releases may have to be rewritten to ensure compatibility and stability. Enterprise versions of Hadoop should be established and the capabilities benchmarked, so organizations can work with a stable environment and code set.
Hadoop is a core component of big data analytics and is here to stay, but a more effective toolkit is needed to reduce the complexity of MapReduce syntax and help developers generate and execute Hadoop jobs and abstracts. Tools that address these issues are coming to market and will evolve to provide a more developer-friendly and enterprise-stable Hadoop ecosystem
It’s only a matter of time before Hadoop MapReduce goes mainstream. Addressing the factors described above can accelerate the process.