The explosive growth in the amount of data created in the world continues to accelerate and surprise us in terms of sheer volume, though experts could see the signposts along the way. Gordon Moore, co-founder of Intel and the namesake of Moore's law, first forecast that the number of transistors that could be placed on an integrated circuit would double year over year. Since 1965, this "doubling principle" has been applied to many areas of computing and has more often than not been proven correct.
When applied to data, not even Moore's law seems to keep pace with the exponential growth of the past several years. Recent IDC research on digital data indicates that in 2010, the amount of digital information in the world reached beyond a zettabyte in size. That's one trillion gigabytes of information. To put that in perspective, a blogger at Cisco Systems noted that a zettabyte is roughly the size of 125 billion 8GB iPods fully loaded.
As the overall digital universe has expanded, so has the world of enterprise data. The good news for data management professionals is that our working data won't reach zettabyte scale for some time, but in the meantime we will deal with unprecedented data growth from a wide variety of sources and systems.
The term "big data" has emerged to describe this growth along with the systems and technology required to leverage it. As with many new technologies, the term has yet to be universally defined, but generally speaking, big data represents data sets that can no longer be easily managed or analyzed with traditional or common data management tools, methods and infrastructures.
At its core, big data carries certain characteristics that add to this challenge, including high velocity, high volume and in some cases a variety of data structures. These characteristics bring new challenges to data analysis, search, data integration, information discovery and exploration, reporting and system maintenance.
Big Data Sources
Early adopters of big data included scientific communities with access to expensive supercomputing environments designed for analyzing massive data sources. These projects attacked the volume side of the big data challenge but not necessarily the velocity and variety aspects - and they were expensive. Pioneering work focused on specialty projects like genomics research or pharmaceutical research data using advanced analytics to discover information that was too difficult to identify in traditional environments. Today, the scope of big data is growing beyond niche sources to include sensor and machine data, transactional data, metadata, social network data and consumer-authored information.
An example of sensor and machine data is found at the Large Hadron Collider at CERN, the European Organization for Nuclear Research. CERN scientists can generate 40 terabytes of data every second during experiments.
Similarly, Boeing jet engines can produce 10 terabytes of operational information for every 30 minutes they turn. A four- engine jumbo jet can create 640 terabytes of data on just one e Atlantic crossing; multiply that by the more than 25,000 flights flown each day, and you get an understanding of the impact that sensor and machine-produced data can make on a BI environment.
Social network data is a new and exciting source of big data that companies would like to leverage. The microblogging site Twitter serves more than 200 million users who produce more than 90 million "tweets" per day, or 800 per second. Each of these posts is approximately 200 bytes in size. On an average day, this traffic equals more than 12 gigabytes and, throughout the Twitter ecosystem, the company produces a total of eight terabytes of data per day. In comparison, the New York Stock Exchange produces about one terabyte of data per day.
In July of this year, Facebook announced they had surpassed the 750 million active-user mark, making the social networking site the largest consumer-driven data source in the world. Facebook users spend more than 700 billion minutes per month on the service, and the average user creates 90 pieces of content every 30 days. Each month, the community creates more than 30 billion pieces of content ranging from Web links, news, stories, blog posts and notes to videos and photos. Not all of this information is useful to enterprise companies, but Facebook represents a goldmine of consumer data that can be integrated into CRM systems, call center applications and various business intelligence programs.
Transactional data has grown in velocity and volume at many companies. As recently as 2005, the largest data warehouse in the world was estimated to be 100 terabytes in size. Today, Wal-Mart, the world's largest retailer, is logging one million customer transactions per hour and feeding information into databases estimated at 2.5 petabytes in size.
It is no surprise that, as all of this data enters our systems, organizing and managing it becomes an enormous task. Metadata management systems are being stretched to maintain a categorical view and to deliver functionality around big data. Metadata, which many call the information about our information, is growing as quickly as data itself in big data environments.
As we see, the data of big data comes from a wide variety of sources. In some cases, it resembles traditional data sources; in others, it's highly unstructured and moving at a velocity that makes it difficult to analyze.
Data, Technology and Affordability
The convergence of new technologies, growing information stores and a reduction in the overall cost and time needed for analysis has helped big data jump the chasm from innovation to early adoption. Big data is still an early-stage technology, but expect that over the next 18 months it will break double digits on project adoption. Using anecdotal references, less than 10 percent of enterprises appear to have deployed a big data project to this date.