The explosive growth in the amount of data created in the world continues to accelerate and surprise us in terms of sheer volume, though experts could see the signposts along the way. Gordon Moore, co-founder of Intel and the namesake of Moore's law, first forecast that the number of transistors that could be placed on an integrated circuit would double year over year. Since 1965, this "doubling principle" has been applied to many areas of computing and has more often than not been proven correct.
When applied to data, not even Moore's law seems to keep pace with the exponential growth of the past several years. Recent IDC research on digital data indicates that in 2010, the amount of digital information in the world reached beyond a zettabyte in size. That's one trillion gigabytes of information. To put that in perspective, a blogger at Cisco Systems noted that a zettabyte is roughly the size of 125 billion 8GB iPods fully loaded.
As the overall digital universe has expanded, so has the world of enterprise data. The good news for data management professionals is that our working data won't reach zettabyte scale for some time, but in the meantime we will deal with unprecedented data growth from a wide variety of sources and systems.
The term "big data" has emerged to describe this growth along with the systems and technology required to leverage it. As with many new technologies, the term has yet to be universally defined, but generally speaking, big data represents data sets that can no longer be easily managed or analyzed with traditional or common data management tools, methods and infrastructures.
At its core, big data carries certain characteristics that add to this challenge, including high velocity, high volume and in some cases a variety of data structures. These characteristics bring new challenges to data analysis, search, data integration, information discovery and exploration, reporting and system maintenance.
Big Data Sources
Early adopters of big data included scientific communities with access to expensive supercomputing environments designed for analyzing massive data sources. These projects attacked the volume side of the big data challenge but not necessarily the velocity and variety aspects - and they were expensive. Pioneering work focused on specialty projects like genomics research or pharmaceutical research data using advanced analytics to discover information that was too difficult to identify in traditional environments. Today, the scope of big data is growing beyond niche sources to include sensor and machine data, transactional data, metadata, social network data and consumer-authored information.
An example of sensor and machine data is found at the Large Hadron Collider at CERN, the European Organization for Nuclear Research. CERN scientists can generate 40 terabytes of data every second during experiments.
Similarly, Boeing jet engines can produce 10 terabytes of operational information for every 30 minutes they turn. A four- engine jumbo jet can create 640 terabytes of data on just one e Atlantic crossing; multiply that by the more than 25,000 flights flown each day, and you get an understanding of the impact that sensor and machine-produced data can make on a BI environment.
Social network data is a new and exciting source of big data that companies would like to leverage. The microblogging site Twitter serves more than 200 million users who produce more than 90 million "tweets" per day, or 800 per second. Each of these posts is approximately 200 bytes in size. On an average day, this traffic equals more than 12 gigabytes and, throughout the Twitter ecosystem, the company produces a total of eight terabytes of data per day. In comparison, the New York Stock Exchange produces about one terabyte of data per day.
In July of this year, Facebook announced they had surpassed the 750 million active-user mark, making the social networking site the largest consumer-driven data source in the world. Facebook users spend more than 700 billion minutes per month on the service, and the average user creates 90 pieces of content every 30 days. Each month, the community creates more than 30 billion pieces of content ranging from Web links, news, stories, blog posts and notes to videos and photos. Not all of this information is useful to enterprise companies, but Facebook represents a goldmine of consumer data that can be integrated into CRM systems, call center applications and various business intelligence programs.
Transactional data has grown in velocity and volume at many companies. As recently as 2005, the largest data warehouse in the world was estimated to be 100 terabytes in size. Today, Wal-Mart, the world's largest retailer, is logging one million customer transactions per hour and feeding information into databases estimated at 2.5 petabytes in size.
It is no surprise that, as all of this data enters our systems, organizing and managing it becomes an enormous task. Metadata management systems are being stretched to maintain a categorical view and to deliver functionality around big data. Metadata, which many call the information about our information, is growing as quickly as data itself in big data environments.
As we see, the data of big data comes from a wide variety of sources. In some cases, it resembles traditional data sources; in others, it's highly unstructured and moving at a velocity that makes it difficult to analyze.
Data, Technology and Affordability
The convergence of new technologies, growing information stores and a reduction in the overall cost and time needed for analysis has helped big data jump the chasm from innovation to early adoption. Big data is still an early-stage technology, but expect that over the next 18 months it will break double digits on project adoption. Using anecdotal references, less than 10 percent of enterprises appear to have deployed a big data project to this date.
At the center of the big data movement is an open source software framework created by Doug Cutting, formerly of Yahoo!, called Hadoop. Hadoop has become the technology of choice to support applications that in turn support petabyte-sized analytics utilizing large numbers of computing nodes.
The Hadoop system consists of three projects: Hadoop Common, a utility layer that provides access to the Hadoop Distributed File System and Hadoop subprojects. HDFS acts as the data storage platform for the Hadoop framework and can scale to massive size when distributed over numerous computing nodes.
Hadoop MapReduce is a powerful framework for processing data sets across clusters of Hadoop nodes. The Map and Reduce process splits the work by first mapping the input across the control nodes of the cluster, then splitting the workload into even smaller data sets and distributing it further throughout the computing cluster. This allows it to leverage massively parallel processing, a computing advantage that technology has introduced to modern system architectures. With MPP, Hadoop can run on inexpensive commodity servers, dramatically reducing the upfront capital costs traditionally required to build out a massive system. As the nodes "return" their answers, the Reduce function collects and combines the information to deliver a final result. To do similar work in the past called for highly specialized software and hardware combinations, a significant hurdle to companies that attempted to analyze data of very large size.
Internet-age companies like Yahoo! and Facebook embraced Hadoop early on and have built out some of the largest Hadoop implementations in the world. Yahoo! continues its efforts around Hadoop in an independent spin-off called Hortonworks (as in Dr. Seuss's elephant). Yahoo! also laid a foundation for Hadoop's success by donating its Hadoop code to the Apache foundation, jumpstarting the open source framework community.
Facebook is now reportedly powering the world's largest Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. That single project has grown 10 petabytes since 2010 and supports critical business intelligence and analytic functions for the company.
Growing the Environment
To extend the Hadoop ecosystem capabilities, new open source projects have added functionality and enterprise-ready features to the environment. Many share the colorful naming conventions of Doug Cutting's Hadoop, named after his son's toy elephant.
- Avro is a data serialization system that converts data into a fast, compact binary data format. When Avro data is stored in a file, its schema is stored with it.
- Cassandra is a column oriented, highly scalable distributed database. The Cassandra database features a higher level of fault tolerance than HDFS.
- Chukwa is a large-scale monitoring system that provides insights into the Hadoop distributed file system and MapReduce.
- HBase is a scalable, column-oriented distributed database modeled after Google's BigTable distributed storage system. HBase is well-suited for real-time data analysis.
- Hive is a data warehouse infrastructure that provides ad hoc query and data summarization for Hadoop-supported data. Hive utilizes a SQL-like query language call HiveQL. HiveQL can also be used by programmers to execute custom MapReduce jobs.
- Mahout is a data mining library designed to work on the Hadoop framework. Mahout delivers a core set of algorithms designed for clustering, classification and batch-based filtering.
- Pig is a high-level programming language and execution framework for parallel computation. Pig works within the Hadoop and MapReduce frameworks.
- ZooKeeper provides coordination, configuration and group services for distributed applications working over the Hadoop stack.
The combination of commodity servers and open source software make Hadoop a compelling enterprise solution.
It's Not a Perfect World
Hadoop is doing an excellent job of fostering the early adoption of big data. The framework is uniquely designed to meet the challenges of big data through scalability and cost. The additional projects that have grown up around the original framework are helping to overcome some of the initial shortcomings with better features, smarter interfaces and more robust management tools.
In the end, no solution is perfect. Hadoop suffers from shortcomings still being addressed. For example, HDFS instances communicate with a single named server node, creating a single point of failure if the named server goes offline. When this occurs, the HDFS instance must restart and pick up where it left off, causing significant delays in work processes of the system.
HDFS is not an ACID (atomicity, consistency, isolation, durability) compliant database, and this eliminates HDFS as a primary data source for highly critical enterprise data.
The Hadoop stack is still an emerging platform, and it lacks the polished features that many companies expect. Before embarking on a Hadoop project, it is critical that you research your needs and understand existing systems and how they will work with Hadoop. It's prudent to start with a small, well-defined project prior to diving into the deep end of a new technology stack.
Big Data is Going Mainstream
Hundreds of companies are already working with big data to add value to their BI programs. According to the Apache Hadoop website, Quantcast, a Web metrics and measurement company, is running a 3,000 core, 3,500 terabyte deployment that processes more than a petabyte of raw data each day. The University of Nebraska-Lincoln is utilizing a Hadoop cluster to store 1.6 petabytes of physics data for the computing portion of an experiment.
A deployment I have recently seen up close is Yahoo!. The Yahoo! installations are running on 100,000 CPUs in 40,000 computers, all running Hadoop. Yahoo! uses these systems to support analytics from their advertising systems and Web search. Other big data projects at Yahoo! reach beyond Hadoop to include a partnership with vendor Tableau Software that's designed to optimize ad placement on Yahoo! media properties.
The project supports 400 Yahoo! employees worldwide. The Tableau platform sits over what Yahoo! claims is the largest multidimensional database in the world, a 12 terabyte MOLAP cube that loads four billion new records daily. The Yahoo! team uses analytics to optimize ad campaigns on behalf of its advertisers, selecting the optimal ads to present and publisher to target with the ads.
The ability to include more data in their analysis has proven valuable for Yahoo!. Tableau is able to deliver ad size, audience segments, geography, age, gender and other dimensions that were missing in prior analysis. The system serves predesigned dashboards with full slice-and-dice and drilldown capabilities while giving team members the ability to roam and explore the data environment for new ideas and insights. The ability to execute ad hoc style queries on big data is extremely valuable and supports these line-of-business users with the ability to discover and explore big data in a way they could not in the past.
Another interesting deployment is eBay. eBay is working closely with Teradata to deliver big data analytics to more than 7,500 users. They on-board more than 50 terabytes of new information each day while serving millions of queries. eBay maintains 99.98+ percent of availability and manages more than 100,000 data elements. Their systems process more than 100 petabytes of data every day.
Because the analytic culture at eBay is driven by exploration and testing, 85 percent of the analytic workload is new and unknown queries. eBay has built out three separate analytic environments to serve their users. The first is a six petabyte, 500+ concurrent-user data warehouse designed for structured data and SQL access. A second 40 petabyte, 150 concurrent-user data warehouse is designed for deep analytics. Finally, there is a 20 petabyte, five to 10 concurrent-user Hadoop system to support advanced analytic workload on unstructured data. Confirming the exploratory mindset at eBay, Oliver Ratzesberger, senior director analytics platform at the company, said recently, "The metrics you don't know are expensive - but high in potential ROI."
In broad descriptive terms, Ratzesberger's statement helps us sum up the value proposition for big data. High volumes of information in systems that support analytics at high speed will be the next trend to deliver on the promise of BI and analytics.
Through inventive technologies and design approaches, computing workloads within the enterprise are finding better and even the best platforms for executing their mission. We can be sure that big data technology will be part of the data ecosystem that supports analytics and BI moving forward.
Welcome to the age of big data.
Shawn Rogers is vice president of research, business intelligence and data warehousing, at Enterprise Management Associates, Inc.