The Problem with Unstructured Data
The terms semistructured data and unstructured data can mean different things depending on the context. In this article, I will stick to a simple definition for both. When I use the terms unstructured or semistructured data, I am referring to text-based information, not video or sound, which has no explicit metadata associations but does have implicit metadata that can be understood by a human. For example, a purchase order sent by fax has no explicit metadata, but a human can extract the relevant data items from the document. The difference between semistructured and unstructured lies in whether portions of the data have associated metadata or if there is no metadata at all. For our purposes, I will use the term unstructured data to designate both semistructured and unstructured data in this article moving forward.
Given the formats that unstructured data can take, such as PDFs, Excel files and messaging formats such as EDI, SWIFT and HL7, it is clear that this data is here to stay and in many cases, is growing exponentially. Case in point: EDI is still the data format used by the vast majority of electronic commerce transactions in the world. A key problem, however, resides in the fact that both unstructured data and XML are not naturally handled by the current generation of BI and integration tools - especially extract, transform and load (ETL) technologies. ETL grew out of the need to create data warehouses from production databases, which means that it is geared toward handling large amounts of relational data and very simple data hierarchies. In a world that is moving towards XML, instead of being able to assume that the data in both the source and the target is well-structured data with little or no hierarchy, it is actually deeply hierarchical, and the hierarchies can themselves be quite different. It is clear that the next generation of integration tools will need to do a much better job of inherently supporting both unstructured and XML data in order to continue to deliver on the promise of business integration.
XML as a Common Denominator
By first extracting the information from unstructured data sources into XML format, it is possible to treat integration of unstructured data similarly to XML integration. In addition, structured data has a natural XML structure that can be used to describe it (i.e., a simple reflection of the source structure), so using XML as the common denominator for describing both unstructured data and structured data makes integration simpler to manage.
Using XML as the syntax for the different data types allows a simple, logical flow for combining structured XML and unstructured data (see Figure 1):
- Extract data from structured sources into a "natural" XML stream,
- Extract data from unstructured sources into an XML stream,
- Transform the two streams as needed (cleansing, lookup, etc.),
- Map the XMLs into the target XML.
Figure 1: Standard Flow for Combining Structured, Unstructured and XML Information
With the expansion of XML and unstructured data use-cases, this flow is becoming more and more pervasive in large integration projects. These use cases fall outside the sweet spot of current ETL and enterprise application integration (EAI) integration architectures - the two standard integration platforms in use today. The reason is that both ETL and EAI have difficulty with steps 1 and 4. Step 1 is problematic because there are very few tools on the market that can easily "parse" unstructured data into XML and allow it to be combined with structured data. Step 4 is challenging since current integration tools also have underpowered mapping tools that fall apart when hierarchy changes or when other complex mappings are needed. All of today's ETL and EAI tools require hand coding to meet these challenges, which adds both time and expense to the integration project.
The Importance of Parsing
When working with unstructured data, it is intuitive that parsing the data to extract the relevant information is a basic requirement. Hand-coding a parser is difficult, error-prone and tedious work, which is why it must be a basic part of any integration tool (ETL or EAI). Given its importance, it is surprising that integration tool vendors have only just started to address this requirement. Example-driven parsing has proven to be the paradigm of choice for creating dynamic parsers for unstructured data formats commonly found in enterprise applications. PDF files and spreadsheets, for example, are pervasive in such diverse business processes as claims processing and patient record management. Think of example-driven parsing as an equivalent mapping approach for unstructured data where you can define, test and debug a parser using a visual mark-and-map process directly on a data source sample.
The key criteria in selecting tools for parsing the full spectrum of data is the flexibility to represent sources and targets in a presentation state that is most efficient to work with for users who are knowledgeable about the data. Tools with powerful data visualization are preferred in cases where data is unstructured. A comprehensive tool must represent hierarchies as easily as it does documents and native XML, allowing for dynamic parser creation (and the reflective reverse, often referred to as a serializer, entirely visually). Forcing the wrong tool or visual metaphor on data for parsing (and mapping, for that matter) forces the development of custom code, which is highly undesirable for integration for a variety of reasons, not the least of which is maintenance and portability.









