At the center of the big data movement is an open source software framework created by Doug Cutting, formerly of Yahoo!, called Hadoop. Hadoop has become the technology of choice to support applications that in turn support petabyte-sized analytics utilizing large numbers of computing nodes.
The Hadoop system consists of three projects: Hadoop Common, a utility layer that provides access to the Hadoop Distributed File System and Hadoop subprojects. HDFS acts as the data storage platform for the Hadoop framework and can scale to massive size when distributed over numerous computing nodes.
Hadoop MapReduce is a powerful framework for processing data sets across clusters of Hadoop nodes. The Map and Reduce process splits the work by first mapping the input across the control nodes of the cluster, then splitting the workload into even smaller data sets and distributing it further throughout the computing cluster. This allows it to leverage massively parallel processing, a computing advantage that technology has introduced to modern system architectures. With MPP, Hadoop can run on inexpensive commodity servers, dramatically reducing the upfront capital costs traditionally required to build out a massive system. As the nodes "return" their answers, the Reduce function collects and combines the information to deliver a final result. To do similar work in the past called for highly specialized software and hardware combinations, a significant hurdle to companies that attempted to analyze data of very large size.
Internet-age companies like Yahoo! and Facebook embraced Hadoop early on and have built out some of the largest Hadoop implementations in the world. Yahoo! continues its efforts around Hadoop in an independent spin-off called Hortonworks (as in Dr. Seuss's elephant). Yahoo! also laid a foundation for Hadoop's success by donating its Hadoop code to the Apache foundation, jumpstarting the open source framework community.
Facebook is now reportedly powering the world's largest Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. That single project has grown 10 petabytes since 2010 and supports critical business intelligence and analytic functions for the company.
Growing the Environment
To extend the Hadoop ecosystem capabilities, new open source projects have added functionality and enterprise-ready features to the environment. Many share the colorful naming conventions of Doug Cutting's Hadoop, named after his son's toy elephant.
- Avro is a data serialization system that converts data into a fast, compact binary data format. When Avro data is stored in a file, its schema is stored with it.
- Cassandra is a column oriented, highly scalable distributed database. The Cassandra database features a higher level of fault tolerance than HDFS.
- Chukwa is a large-scale monitoring system that provides insights into the Hadoop distributed file system and MapReduce.
- HBase is a scalable, column-oriented distributed database modeled after Google's BigTable distributed storage system. HBase is well-suited for real-time data analysis.
- Hive is a data warehouse infrastructure that provides ad hoc query and data summarization for Hadoop-supported data. Hive utilizes a SQL-like query language call HiveQL. HiveQL can also be used by programmers to execute custom MapReduce jobs.
- Mahout is a data mining library designed to work on the Hadoop framework. Mahout delivers a core set of algorithms designed for clustering, classification and batch-based filtering.
- Pig is a high-level programming language and execution framework for parallel computation. Pig works within the Hadoop and MapReduce frameworks.
- ZooKeeper provides coordination, configuration and group services for distributed applications working over the Hadoop stack.
The combination of commodity servers and open source software make Hadoop a compelling enterprise solution.
It's Not a Perfect World
Hadoop is doing an excellent job of fostering the early adoption of big data. The framework is uniquely designed to meet the challenges of big data through scalability and cost. The additional projects that have grown up around the original framework are helping to overcome some of the initial shortcomings with better features, smarter interfaces and more robust management tools.
In the end, no solution is perfect. Hadoop suffers from shortcomings still being addressed. For example, HDFS instances communicate with a single named server node, creating a single point of failure if the named server goes offline. When this occurs, the HDFS instance must restart and pick up where it left off, causing significant delays in work processes of the system.
HDFS is not an ACID (atomicity, consistency, isolation, durability) compliant database, and this eliminates HDFS as a primary data source for highly critical enterprise data.
The Hadoop stack is still an emerging platform, and it lacks the polished features that many companies expect. Before embarking on a Hadoop project, it is critical that you research your needs and understand existing systems and how they will work with Hadoop. It's prudent to start with a small, well-defined project prior to diving into the deep end of a new technology stack.
Big Data is Going Mainstream
Hundreds of companies are already working with big data to add value to their BI programs. According to the Apache Hadoop website, Quantcast, a Web metrics and measurement company, is running a 3,000 core, 3,500 terabyte deployment that processes more than a petabyte of raw data each day. The University of Nebraska-Lincoln is utilizing a Hadoop cluster to store 1.6 petabytes of physics data for the computing portion of an experiment.
A deployment I have recently seen up close is Yahoo!. The Yahoo! installations are running on 100,000 CPUs in 40,000 computers, all running Hadoop. Yahoo! uses these systems to support analytics from their advertising systems and Web search. Other big data projects at Yahoo! reach beyond Hadoop to include a partnership with vendor Tableau Software that's designed to optimize ad placement on Yahoo! media properties.