The data landscape now encompasses a dizzying array of new information channels, new sources of data, and new analysis and reporting imperatives. According to analyst groups, nearly 80 percent of today's data is unstructured, and new information channels such as Web, email, voice over IP, instant messaging (IM) and text messaging are rapidly creating huge stores of nontraditional data.
Never has there been a better opportunity to gain real insight from data - or a bigger challenge for BI practitioners and technologists. While most businesses are eager to turn this new data into useful information, many find that their current BI technology, designed for a simpler data landscape, cannot deliver robust and thorough analysis of mission-critical unstructured data.
This article examines some of the challenges BI practitioners face in addressing unstructured data and proposes a new set of requirements for the next generation of BI technology designed to overcome these challenges.
The Evolution of Unstructured Data
All data is "born" unstructured and then acquires structure through human intervention. Unfortunately, in the process of adding structure to data, we generally lose significant information along the way.
The loss of information in the name of structure goes back to the early origins of data. The first recorded business-related data is sales transactions found on clay tablets from Mesopotamia. These sales records represent the first example of structured business data: while the final transaction has been recorded, the steps the buyer and selling went through to arrive at the price - the "haggling" data - is lost forever.
Not much has changed since then. Buyer and seller still need to agree on a deal and settle by means of a contract, followed by a sales record that captures some very basic information such as a buyer, a seller, a price, a product and the like. No matter how extensive business negotiations might have been and no matter how much data was generated in the process, only a small and repeatable portion of this data gets captured in a highly structured form, such as a sales transaction record.
Today, however, haggling data can be captured in a semistructured way using XML and HTML, among other formats, which involves tagging the data and then managing the data using the tags. This auxiliary business data can be extremely important to understand because it holds the key to improving business conversion rates, to achieving higher levels of customer satisfaction and to increasing the competitiveness of the business. Although we can now, for the first time, begin to record this haggling data for the ages, analysis of this data still remains a computational challenge.
Extensible Data
Structured data has become synonymous with relational data, while unstructured data is commonly associated with file servers and document management systems. But what about data that falls right in between: unstructured data that has started on its evolutionary path but has not made it all the way to completely structured?
Data in this state of flux is often referred to as "extensible data." Extensible data is unstructured data in the state of transition to a structured form. XML data, HTML pages, PDF documents and email messages, HTTP traffic and clickstream data, search results and application log files are all examples of extensible data. This is the haggling data discussed earlier.
There is significant value in tapping into extensible data, with many critical application areas that stand to benefit. For example, being able to link call center log files with other sources of data could enable a call center manager to understand what is driving an increase in interaction handling times; having insight into call log data generated by self-service speech applications could potentially decrease transfers to live agents.
Figure 1 offers some examples of possible application areas for extensible data analysis.

Figure 1: Potential Application Areas for Extensible Data Analysis
The Extensible Data Challenge
While the benefits of analysis may be significant, extensible data has certain characteristics that make it hard to address with traditional BI technologies.
Extensible data is rich in content and is often rich in structure, and that means structural complexity. Extensible data is characterized by dimensional, hierarchical and containment relationships going above and beyond traditional data models used for structured data (clickstream data, for example, is hierarchical because it is generated as a result of visiting the hierarchical environment of a Web site).
Extensible data is quite variable. Every instance of extensible data can potentially be different from another instance, not just regarding data values but also data structures (e.g., a collection of Web pages on a single Web site). Data management techniques applied to one instance of extensible data may not work on the next instance; for example, some customers might use IM to obtain technical support and others may use a self-service Web site for the same purpose.
Extensible data is dynamic. Extensible data is prone to change because structure is constantly being imposed upon it. For example, an application log is always subject to change following any changes in the application logic and/or application usage patterns. With constant change also comes an increasing number of versions of the extensible data structure.
Extensible data volumes are huge. Nonrelational data is often produced in volumes that massively exceed the typical output of a transactional system, creating a data management nightmare even before this data becomes an analytical challenge. Extensible data encompasses much of the data captured during the course of a business interaction. For example, hundreds of thousands of insurance quotes generating mountains of extensible data result in only a fraction of the policies eventually stored in a structured format in the system.









Be the first to comment on this post using the section below.