DEC 1, 2006 1:00am ET

Related Links

Visiting Nurse Service Cares About Cloud Security
October 25, 2011
Light at the End of the Silo
October 28, 2010
Pitney Bowes Releases Enhancements to MapInfo Professional
September 13, 2010

Web Seminars

Data Discovery for Big Insights
Available On Demand
How to Narrow the IT/Business Communication Gap
Available On Demand
Suit Yourself: An Effective Recipe for Self-Service Analytics
Available On Demand

Combining Structured, Semistructured and Unstructured Data in Business Applications

Print
Reprints
Email

There is a growing consensus that both semistructured and unstructured data sources contain business-critical information and therefore must be made accessible both for business intelligence (BI) and operational needs.1, 2 It is also clear that the amount of relevant unstructured business data is not only growing but will continue to grow in the foreseeable future. This trend is converging with the opening of business data through standardized XML formats and industry-specific XML data standards (e.g., ACORD in insurance, HL7 in healthcare). These two trends are expanding the types of data that need to be handled by business intelligence (BI) and integration tools and are consequently straining their transformation capabilities. This mismatch between existing transformation capabilities and these emerging needs is opening the door for a new type of universal data transformation products that will allow transformations to be defined for all classes of data (e.g., structured, semistructured, unstructured), without writing code and deployed to any software application or platform architecture.

The Problem with Unstructured Data

The terms semistructured data and unstructured data can mean different things depending on the context. In this article, I will stick to a simple definition for both. When I use the terms unstructured or semistructured data, I am referring to text-based information, not video or sound, which has no explicit metadata associations but does have implicit metadata that can be understood by a human. For example, a purchase order sent by fax has no explicit metadata, but a human can extract the relevant data items from the document. The difference between semistructured and unstructured lies in whether portions of the data have associated metadata or if there is no metadata at all. For our purposes, I will use the term unstructured data to designate both semistructured and unstructured data in this article moving forward.

Given the formats that unstructured data can take, such as PDFs, Excel files and messaging formats such as EDI, SWIFT and HL7, it is clear that this data is here to stay and in many cases, is growing exponentially. Case in point: EDI is still the data format used by the vast majority of electronic commerce transactions in the world. A key problem, however, resides in the fact that both unstructured data and XML are not naturally handled by the current generation of BI and integration tools - especially extract, transform and load (ETL) technologies. ETL grew out of the need to create data warehouses from production databases, which means that it is geared toward handling large amounts of relational data and very simple data hierarchies. In a world that is moving towards XML, instead of being able to assume that the data in both the source and the target is well-structured data with little or no hierarchy, it is actually deeply hierarchical, and the hierarchies can themselves be quite different. It is clear that the next generation of integration tools will need to do a much better job of inherently supporting both unstructured and XML data in order to continue to deliver on the promise of business integration.

XML as a Common Denominator

By first extracting the information from unstructured data sources into XML format, it is possible to treat integration of unstructured data similarly to XML integration. In addition, structured data has a natural XML structure that can be used to describe it (i.e., a simple reflection of the source structure), so using XML as the common denominator for describing both unstructured data and structured data makes integration simpler to manage.

Using XML as the syntax for the different data types allows a simple, logical flow for combining structured XML and unstructured data (see Figure 1):

  1. Extract data from structured sources into a "natural" XML stream,
  2. Extract data from unstructured sources into an XML stream,
  3. Transform the two streams as needed (cleansing, lookup, etc.),
  4. Map the XMLs into the target XML.

Figure 1: Standard Flow for Combining Structured, Unstructured and XML Information

With the expansion of XML and unstructured data use-cases, this flow is becoming more and more pervasive in large integration projects. These use cases fall outside the sweet spot of current ETL and enterprise application integration (EAI) integration architectures - the two standard integration platforms in use today. The reason is that both ETL and EAI have difficulty with steps 1 and 4. Step 1 is problematic because there are very few tools on the market that can easily "parse" unstructured data into XML and allow it to be combined with structured data. Step 4 is challenging since current integration tools also have underpowered mapping tools that fall apart when hierarchy changes or when other complex mappings are needed. All of today's ETL and EAI tools require hand coding to meet these challenges, which adds both time and expense to the integration project.

The Importance of Parsing

When working with unstructured data, it is intuitive that parsing the data to extract the relevant information is a basic requirement. Hand-coding a parser is difficult, error-prone and tedious work, which is why it must be a basic part of any integration tool (ETL or EAI). Given its importance, it is surprising that integration tool vendors have only just started to address this requirement. Example-driven parsing has proven to be the paradigm of choice for creating dynamic parsers for unstructured data formats commonly found in enterprise applications. PDF files and spreadsheets, for example, are pervasive in such diverse business processes as claims processing and patient record management. Think of example-driven parsing as an equivalent mapping approach for unstructured data where you can define, test and debug a parser using a visual mark-and-map process directly on a data source sample.

The key criteria in selecting tools for parsing the full spectrum of data is the flexibility to represent sources and targets in a presentation state that is most efficient to work with for users who are knowledgeable about the data. Tools with powerful data visualization are preferred in cases where data is unstructured. A comprehensive tool must represent hierarchies as easily as it does documents and native XML, allowing for dynamic parser creation (and the reflective reverse, often referred to as a serializer, entirely visually). Forcing the wrong tool or visual metaphor on data for parsing (and mapping, for that matter) forces the development of custom code, which is highly undesirable for integration for a variety of reasons, not the least of which is maintenance and portability.

The Importance of Mapping

Advertisement

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.