"I have too much data to analyze" is a common complaint of business analysts. Today's definition of a very large data set has evolved dramatically from what was considered large ten years ago. During my early days at Arbor Software, the company that pioneered the concept of online analytical processing (OLAP) and later became Hyperion Solutions (which was acquired by Oracle), I was chartered with extending our product's capabilities to support very large data sets. At the time, several gigabytes of data were considered large... and tens of gigabytes were considered huge. My work resulted in a set of product improvements that helped to analyze gigabytes of data at reasonable speeds.

But even then, it was already becoming apparent that analysis of very large data sets would continue to be a significant problem. With recent advances in storage technology and the storage price decline, a terabyte of data is not uncommon - and large data sets are now measured in petabytes.

While storage technology advances have been staggering, progress in analytical data processing has been marginal. It is possible to find a piece of data in a petabyte-size storage system (Google, Yahoo! and other search engine technology vendors are a testament to that), but analyzing this data to find correlations and trends meaningful to a business analyst remains a huge challenge.

"Why do we need to analyze this data - and what is the nature of this data that grew in size by a factor of millions over the last ten years?" The answer is simple: we live in a world that is getting more digitized and more connected every day. We use networks to talk, shop, read and work. We all have a digital life that is only growing bigger. Our music and photo libraries are stored and managed by network providers, our answering machines are virtual mailboxes, instant messenger IDs are on our business cards and our reference libraries are online. We spend hours every day in this digital world, so it's no wonder the amount of data we access online is growing at an exponential - and unstoppable - rate.

What's more, most businesses today use IT to support every conceivable business function. What this means is that trillions of digital interactions and transactions are generated and carried out by various networks hourly and daily. Where does all this data go? Some ends up in databases; most ends up in log files discarded on a regular basis, because even a petabyte-sized storage system is not large enough to keep all this transaction data gathered over an extended period of time.

The ability to analyze this largely untapped data is the holy grail for business intelligence (BI) practitioners and business owners alike. Imagine, for a moment, what business would be like if companies could analyze all the data flowing through their systems, instead of just a tiny fraction of it:

  • A network security analyst could preempt an insider intrusion threat if he could quickly analyze all network transactions along with the HR database transactions related to hiring, firing and compensation changes.
  • A CFO could discover financial improprieties if he could analyze financial transactions along with network system log transactions.
  • A marketer could make real-time adjustments to a broadly executed marketing campaign if he could analyze transactions from the Web site along with transactions from the enterprise resource planning (ERP) system and call detail records from the call center.

There is much insight to be gained by analyzing large volumes and all types of corporate data, and yet we are compelled not to ask those questions, because our existing BI technologies lack the analytical capabilities to answer them.
This article examines the challenges of analyzing large volumes of complex transactional data and proposes new approaches to data analysis that go beyond conventional BI technology.

Data and Information

Too often the words "data" and "information" are used interchangeably, when there is a very significant distinction between the two. An easy way to think about the difference is as follows:

  • Data is the input for the analysis system, and
  • Information is the output.

Analysis is the process of turning data into information.
While it may seem basic, this distinction is important, because it opens up a different way to approach the problem of analyzing large volumes of data, instead of relying on traditional database- and data warehouse-centric approaches.

Let's examine a simple scenario of network flow data analysis:

Network flow data is captured for every single chunk (packet) of data that moves on the network. The simple operation of a person looking at a page on a Web site will generate quite a few network flow transactions that capture both the request going from the user to the Web site, and the response going from the Web site to the user. A single network flow transaction is comprised of a source (a source IP address), a target (a destination IP address), and some notion of the size of the data moved. At this level of granularity, the network flow transaction is a good example of data - not information - with the analytical value of an individual transaction close to zero. But as soon as multiple network flow transactions are associated with a single Web page lookup and are aggregated, one could get access to some basic operational information, such as:

  • How much data was transferred for a particular Web page?
  • How long did it take?
  • Were there any errors generated in the process?

And suddenly the Web site operator gains valuable insight into overall Web site performance.
But there is much more to be gleaned from these transactions than this simple information. If we continue with the data aggregation and summarization exercise in this example, we might even get to business information such as:

  • Quality of service - how much traffic does a particular user generate on the network over a fixed period of time, and how many network errors occur in the process?
  • IT chargeback - how much network traffic does a business application generate over a month?
  • Compliance and intrusion detection - which users have the highest traffic volume on the network?

These are the questions a business analyst would be interested in asking. But traditional BI tools have not been able to deliver the answers.
This example not only illustrates the differences between data and information, it also explains what needs to happen to improve the process of creating operational and business information from data.

Requirements for the Process of Converting Data into Information

The network flow data example touches on every single requirement for an effective process of converting data into information, including:

  • A large data set must be associated with other data (for example, network flow data associated with business application data) to produce meaningful information.
  • Several processing steps (aggregations and summarization) may be necessary, first to turn data into operational information, then into business information.
  • The entire process has to be reasonably fast (it doesn't help anyone to identify a security breach a month after it occurs).

Unfortunately, these three requirements conflict with each other. The more data that needs to be processed and the more intelligence we want to gain from it, the longer the process is going to take. It's not surprising that the majority of analytical applications designed for such large data sets as Web and network traffic data are mostly focused on:

  • Event correlation, because it can be done on a smaller data set; or
  • Operational information, because it requires the least amount of processing.

As we try to extract more valuable information from data, the analysis process takes longer and longer, and as we try to apply this process to large data sets, we begin to hit the performance brick wall.
We know that conventional analytical solutions are severely limited in their capability to get information out of large data sets; so let's explore the alternatives.

Extracting Information from Large Data Volumes Remains Challenging

There are several known approaches to information extraction when dealing with very large data volumes:

  • Search - a technique that is often confused with analysis. Extracting information is a process of transforming data. While a search process is efficient when applied to large data volumes, it merely finds what is already inside data instead of producing information.
  • BI or business analytics (BA) - an approach relying on database technologies. While this is the most common approach to the problem, it is fundamentally flawed and breaks down when dealing with very large data sets.

With the BI/BA approach, the main roadblock to analyzing very large data sets is latency:

  • If terabytes of data are generated hourly and daily, a highly scalable database is necessary just to keep up with this data volume and get the new data into the database. We saw a major bank trying to analyze its daily Web traffic data using a database. It required about 23 hours to add 24 hours worth of data to the database, and then another two hours to run analytical queries against this data. It was only natural that the bank was falling behind every day and was forced to start sampling data, which created credibility problems for its business analysts.
  • If terabytes of data need to be perused in order to run an analytical query, and if the database is growing daily, the latency of analytical queries will increase exponentially and eventually will render the entire system unusable. This is the main reason why business information is rarely available for very large data sets.

New Technologies are Emerging to Address the Data Volume Challenge

Three problems need to be solved when dealing with analysis of very large data sets:

  • First, it is necessary to create associations between multiple data sources that can be quite complex. For instance, mapping a Web page to network traffic to understand the data flow across different URLs on the page is a nontrivial problem, if addressed in a generic way.
  • Second, there must be a way to do data analysis without first putting data into a database. Data insertion and subsequent data indexing are among the slowest of database operations.
  • Third, there must be a way to analyze only new data without sacrificing the quality of information. The ability to create information from new data, as well as from information already produced from old data, is the only way to deal with the exponential complexity of running analytical queries against very large data sets.

Some technologies attempt to solve some parts of this puzzle. For example, streaming databases address the second half of the problem by focusing on data analysis within a relatively small time window and thus focusing only on analyzing new data.
Specialized analytical applications, such as Web analytics and network traffic analytics, are trying to improve the first half of the problem by streamlining the database insertion process. Through Web site or network instrumentation and exact mapping of this instrumentation to the database, these applications can gain some performance improvements. But only a few companies are addressing the problem as a whole.

Only a few years ago, a couple dozen gigabytes of data was considered very large. But with advances in storage technology and lower costs for storage, it's not unusual for companies today to deal with terabytes - or even petabytes - of data. What has eluded most companies is the ability to convert this data into meaningful business information.

However, new alternatives to traditional BI are coming onto the market, and not a moment too soon. I anticipate that as high-volume data analysis solutions become more pervasive, the bar will be raised dramatically for what people expect from their business information systems. Just as the basic reporting capabilities of a decade ago forever changed how we manage and measure our businesses, so to will high-volume data analysis become a powerful and necessary requirement for doing business in the coming years.

Case Study: One of the Largest Content Delivery Networks in the Country

Objective: A reporting and analytics solution to provide timely visibility into very large volumes of data.

Challenges: Data collected is semistructured and complex; and customers demand near real-time analysis and reporting.

Solution: XML-based BI solution

Results

  • A rapidly deployed, easy-to-use reporting and analytics solution;
  • Built to handle huge volumes of complex semi-structured data; and
  • A complete set of BI features and functionality at a fraction of the cost and time.

The subject of this case study is the fastest growing global service provider for accelerating applications and content over the Internet. The company provides network infrastructure on demand, optimizing application and content delivery while shifting bandwidth, computing and storage requirements to their own infrastructure. Large multinational corporations such as Verizon Business and Hewlett Packard use the company's services to ensure that their customers receive LAN-like response times when accessing their Web application from anywhere in the world.

Business Challenge

As part of its core solution offering, the company collects large quantities of data, including Web site response times, throughput and other pieces of information related to overall application performance and availability. Their customers were increasingly asking for a reporting and analytics solution integrated directly into the company's global overlay network that could be used to sift through, analyze and provide visibility into overall application performance. Over time, as the company continued to broaden its reach into larger accounts, providing this capability had become absolutely essential. Their operations team also was looking for tools that would increase their ability to measure the operational efficiency of the network, manage SLAs, and provide "low latency" technical support by being able to pinpoint issues through the use of advanced analytics.

These requirements posed a significant data analysis challenge for the product management team. The data that the company collects for its customers is semi-structured and quite complex, and data volumes are massive. Their network manages hundreds of millions of transactions daily, and this number will be in the billions in less than a year. This level of activity generates 500 gigabytes of log files daily, with this number expected to quadruple in a year, eventually reaching two terabytes per day, or nearly one petabyte of data annually.

At the same time, the company needed to be able to provide near real-time analysis and reporting to meet customer and internal demands and differentiate itself from its competitors. They knew that processing such large volumes of nontabular data using a traditional BI system would be prohibitively expensive and slow. The company needed an entirely different approach to BI in order to achieve its goal of providing comprehensive, fast reporting and analytics to its customer base and internal operations.

The Solution

The company implemented an XML-based analytical solution to provide its customers with immediate visibility into the performance of the managed Web infrastructure, while giving its internal operations team fast insight into network performance and the ability to perform customer value analysis and pinpoint problems quickly. The technology was designed from the ground up to concurrently address large volumes of traditional and nontraditional data sources, such as the activity log files collected in the company's system.

The solution uses XML as a common layer to significantly reduce system complexity while offering advanced functionality that cannot be achieved by traditional BI technology. By using XML to tie together different pieces of the BI stack into an integrated, "virtual" technology stack, the solution operates on data where it resides, with no movement or restructuring of data required.

The company was able to analyze large amounts of data coming directly out of activity log files and various network applications without the need to transform and store this data in a data warehouse. This results in extremely high data throughput and near real-time analysis and reporting.

The company can now offer its customer base rapidly deployed reporting and analytics services that are easy to use and built to handle the complexity of the data volumes and structures that are typical in today's highly interactive Web environment. At the same time, they can now also provide internal users with immediate insight into network performance while enabling fast customer value analysis and the ability to quickly identify and handle problems as they arise. The solution was built to scale as the business grows, both with new customers and volumes of data. Scalability is increasingly valuable as the company pursues larger customers who use analytics as a baseline for making technology investment decisions.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access