Unstructured Data: Reading Between the Lines
Today's information environment has evolved in ways that would have been hard to envision for the early pioneers of business intelligence (BI).
The data landscape now encompasses a dizzying array of new information channels, new sources of data, and new analysis and reporting imperatives. According to analyst groups, nearly 80 percent of today's data is unstructured, and new information channels such as Web, email, voice over IP, instant messaging (IM) and text messaging are rapidly creating huge stores of nontraditional data.
Never has there been a better opportunity to gain real insight from data - or a bigger challenge for BI practitioners and technologists. While most businesses are eager to turn this new data into useful information, many find that their current BI technology, designed for a simpler data landscape, cannot deliver robust and thorough analysis of mission-critical unstructured data.
This article examines some of the challenges BI practitioners face in addressing unstructured data and proposes a new set of requirements for the next generation of BI technology designed to overcome these challenges.
The Evolution of Unstructured Data
All data is "born" unstructured and then acquires structure through human intervention. Unfortunately, in the process of adding structure to data, we generally lose significant information along the way.
The loss of information in the name of structure goes back to the early origins of data. The first recorded business-related data is sales transactions found on clay tablets from Mesopotamia. These sales records represent the first example of structured business data: while the final transaction has been recorded, the steps the buyer and selling went through to arrive at the price - the "haggling" data - is lost forever.
Not much has changed since then. Buyer and seller still need to agree on a deal and settle by means of a contract, followed by a sales record that captures some very basic information such as a buyer, a seller, a price, a product and the like. No matter how extensive business negotiations might have been and no matter how much data was generated in the process, only a small and repeatable portion of this data gets captured in a highly structured form, such as a sales transaction record.
Today, however, haggling data can be captured in a semistructured way using XML and HTML, among other formats, which involves tagging the data and then managing the data using the tags. This auxiliary business data can be extremely important to understand because it holds the key to improving business conversion rates, to achieving higher levels of customer satisfaction and to increasing the competitiveness of the business. Although we can now, for the first time, begin to record this haggling data for the ages, analysis of this data still remains a computational challenge.
Structured data has become synonymous with relational data, while unstructured data is commonly associated with file servers and document management systems. But what about data that falls right in between: unstructured data that has started on its evolutionary path but has not made it all the way to completely structured?
Data in this state of flux is often referred to as "extensible data." Extensible data is unstructured data in the state of transition to a structured form. XML data, HTML pages, PDF documents and email messages, HTTP traffic and clickstream data, search results and application log files are all examples of extensible data. This is the haggling data discussed earlier.
There is significant value in tapping into extensible data, with many critical application areas that stand to benefit. For example, being able to link call center log files with other sources of data could enable a call center manager to understand what is driving an increase in interaction handling times; having insight into call log data generated by self-service speech applications could potentially decrease transfers to live agents.
Figure 1 offers some examples of possible application areas for extensible data analysis.
Figure 1: Potential Application Areas for Extensible Data Analysis
The Extensible Data Challenge
While the benefits of analysis may be significant, extensible data has certain characteristics that make it hard to address with traditional BI technologies.
Extensible data is rich in content and is often rich in structure, and that means structural complexity. Extensible data is characterized by dimensional, hierarchical and containment relationships going above and beyond traditional data models used for structured data (clickstream data, for example, is hierarchical because it is generated as a result of visiting the hierarchical environment of a Web site).
Extensible data is quite variable. Every instance of extensible data can potentially be different from another instance, not just regarding data values but also data structures (e.g., a collection of Web pages on a single Web site). Data management techniques applied to one instance of extensible data may not work on the next instance; for example, some customers might use IM to obtain technical support and others may use a self-service Web site for the same purpose.
Extensible data is dynamic. Extensible data is prone to change because structure is constantly being imposed upon it. For example, an application log is always subject to change following any changes in the application logic and/or application usage patterns. With constant change also comes an increasing number of versions of the extensible data structure.
Extensible data volumes are huge. Nonrelational data is often produced in volumes that massively exceed the typical output of a transactional system, creating a data management nightmare even before this data becomes an analytical challenge. Extensible data encompasses much of the data captured during the course of a business interaction. For example, hundreds of thousands of insurance quotes generating mountains of extensible data result in only a fraction of the policies eventually stored in a structured format in the system.
Limitations of Traditional BI Technology
Many businesses are eager to turn extensible data into useful information, but they have found that their current BI technology cannot deliver thorough analysis of this data. Traditional BI infrastructure has inherent technological constraints that limit its ability to address this data.
The first issue is that extensible data must undergo multiple transformations in a traditional BI system before it can be analyzed. Data from transactional systems is in a form that is best suited for data collection, but data for analysis needs to be in a very different form. Data must therefore first be moved from the place where it is being collected (transactional systems) to the place where it is being analyzed (data warehouse). Another data transformation occurs when data from the data warehouse is brought into an analytic model for aggregation, consolidation and rollup. Only then can the actual analysis and reporting take place. The result is an excessively slow time to analysis.
Data integration is a huge technical challenge. New information channels have overlapping business objectives, and the data from these channels must be analyzed together and correlated in order to understand their impact on a particular aspect of the business. With traditional BI, data from different channels must first be brought together into a common model in order to analyze and report on it. This adds another layer of complexity to the picture and further delays analysis.
Vendors of traditional BI technologies often recommend and apply workaround solutions to deal with these challenges.
- Reduce the amount of data pulled into the BI system. The approach favored by some OLAP and ROLAP vendors entails employing a number of extract, transform and load techniques to force a subset of nonrelational data into a relational format. This approach not only causes significant loss of information, but it also strips data of useful metadata for data analysis and reporting.
- Introduce unconventional data at the final reporting stage. While this approach does not cause any information loss, it does not allow for analysis of very significant data segments, such as XML, HTML, HTTP, email and text.
- Build a fixed analytic model with fixed data mapping. The main drawback of this approach is lack of flexibility. Any additions or changes to analysis or data sources require a redo of both the model and the mapping. There are hundreds of vendors using this approach. Most of them offer analytic applications as point solutions and do not scale beyond a single solution.
- Simplify the data transformation processes by keeping the entire BI stack on a single hardware appliance. While performance benefits are achieved with this approach by streamlining the data movement through the BI stack, it still shares the same drawbacks of a traditional BI architecture.
Companies are sitting on a goldmine of data they cannot address using yesterday's technologies. In order to address this high-value but complex category of data, a new approach is needed. The next generation of analytical platforms must be able to:
- Operate on both relational and nonrelational data at the same time;
- Operate on data in place, with no need to transform and move data into common relational format;
- Generate aggregates, summaries and rollups without a data warehouse;
- Automate building and changing of analytic models; and
- Quickly analyze large volumes of data.
New analytic platforms are being introduced to the market that use XML as a common layer to dramatically reduce system complexity while offering functionality that cannot be achieved by traditional BI technology. These open, XML-based architectures can combine data from nonrelational sources with traditional transactional systems or data warehouses to provide an unprecedented view into what is driving business performance.
XML-based analytics technology has already been embraced by early adopters, particularly in areas that generate large amounts of extensible data, such as contact centers. For example, a large contact center that was dealing with an increase in its average call handling time (AHT) initially suspected this was being caused by flawed call routing. They needed a way to analyze call logs in conjunction with transactional data to determine the root cause of the increase. Using an XML-based analytic platform operating directly on unfiltered log data, they were eventually able to determine that AHT was increasing because their agents were spending more time up-selling customers as part of an ongoing promotion. This type of insight would not have been possible using existing BI or reporting technology.
Being able to analyze extensible data allows companies to go beyond measurement of key performance metrics to understand the root causes behind changes in these indicators. By correlating data across different sources of information, business analysts can understand how decisions will impact interdependent business metrics, such as Web site traffic and call volume. These insights enable companies to reduce exposure to hidden risks, ensure compliance, optimize business processes and increase profitability.