"I have too much data to analyze" is a common complaint of business analysts. Today's definition of a very large data set has evolved dramatically from what was considered large ten years ago. During my early days at Arbor Software, the company that pioneered the concept of online analytical processing (OLAP) and later became Hyperion Solutions (which was acquired by Oracle), I was chartered with extending our product's capabilities to support very large data sets. At the time, several gigabytes of data were considered large... and tens of gigabytes were considered huge. My work resulted in a set of product improvements that helped to analyze gigabytes of data at reasonable speeds.
But even then, it was already becoming apparent that analysis of very large data sets would continue to be a significant problem. With recent advances in storage technology and the storage price decline, a terabyte of data is not uncommon - and large data sets are now measured in petabytes.
While storage technology advances have been staggering, progress in analytical data processing has been marginal. It is possible to find a piece of data in a petabyte-size storage system (Google, Yahoo! and other search engine technology vendors are a testament to that), but analyzing this data to find correlations and trends meaningful to a business analyst remains a huge challenge.
"Why do we need to analyze this data - and what is the nature of this data that grew in size by a factor of millions over the last ten years?" The answer is simple: we live in a world that is getting more digitized and more connected every day. We use networks to talk, shop, read and work. We all have a digital life that is only growing bigger. Our music and photo libraries are stored and managed by network providers, our answering machines are virtual mailboxes, instant messenger IDs are on our business cards and our reference libraries are online. We spend hours every day in this digital world, so it's no wonder the amount of data we access online is growing at an exponential - and unstoppable - rate.
What's more, most businesses today use IT to support every conceivable business function. What this means is that trillions of digital interactions and transactions are generated and carried out by various networks hourly and daily. Where does all this data go? Some ends up in databases; most ends up in log files discarded on a regular basis, because even a petabyte-sized storage system is not large enough to keep all this transaction data gathered over an extended period of time.
The ability to analyze this largely untapped data is the holy grail for business intelligence (BI) practitioners and business owners alike. Imagine, for a moment, what business would be like if companies could analyze all the data flowing through their systems, instead of just a tiny fraction of it:
- A network security analyst could preempt an insider intrusion threat if he could quickly analyze all network transactions along with the HR database transactions related to hiring, firing and compensation changes.
- A CFO could discover financial improprieties if he could analyze financial transactions along with network system log transactions.
- A marketer could make real-time adjustments to a broadly executed marketing campaign if he could analyze transactions from the Web site along with transactions from the enterprise resource planning (ERP) system and call detail records from the call center.
There is much insight to be gained by analyzing large volumes and all types of corporate data, and yet we are compelled not to ask those questions, because our existing BI technologies lack the analytical capabilities to answer them.
This article examines the challenges of analyzing large volumes of complex transactional data and proposes new approaches to data analysis that go beyond conventional BI technology.
Data and Information
Too often the words "data" and "information" are used interchangeably, when there is a very significant distinction between the two. An easy way to think about the difference is as follows:
- Data is the input for the analysis system, and
- Information is the output.
Analysis is the process of turning data into information.
While it may seem basic, this distinction is important, because it opens up a different way to approach the problem of analyzing large volumes of data, instead of relying on traditional database- and data warehouse-centric approaches.
Let's examine a simple scenario of network flow data analysis:
Network flow data is captured for every single chunk (packet) of data that moves on the network. The simple operation of a person looking at a page on a Web site will generate quite a few network flow transactions that capture both the request going from the user to the Web site, and the response going from the Web site to the user. A single network flow transaction is comprised of a source (a source IP address), a target (a destination IP address), and some notion of the size of the data moved. At this level of granularity, the network flow transaction is a good example of data - not information - with the analytical value of an individual transaction close to zero. But as soon as multiple network flow transactions are associated with a single Web page lookup and are aggregated, one could get access to some basic operational information, such as:
- How much data was transferred for a particular Web page
- How long did it take?
- Were there any errors generated in the process?
And suddenly the Web site operator gains valuable insight into overall Web site performance.
But there is much more to be gleaned from these transactions than this simple information. If we continue with the data aggregation and summarization exercise in this example, we might even get to business information such as:








