Primer for Unstructured Data and Semantics
Beyond the Data Warehouse
Information Management Online, October 27, 2005
The predicted barrage of the unstructured content is here. This has been predicted for a number of years. However, in the last two years, the regulatory changes (Sarbanes-Oxley, etc.) and some technology advances have made this type of content a relevant target for companies and organizations to manage risk more effectively and accrue business value.
Now that the traditional gurus are addressing this issue, (Inmon, Kimball, etc.), there is a sense of legitimacy to the area. Prior to this, companies felt it was too esoteric to address. Not that the problems or opportunities did not exist previously. (Hey, Ralph and Bill are smart guys, but they didn't invent data for gosh sakes.) However, pain is a great motivator, and now many organizations are in pain due to risk and total failure to fully leverage BI technologies of the past few years.
Therefore, in this information architecture column we are going to address the architectural aspects of the amorphous "stuff" known as unstructured data. We will briefly review classification of this complicated type of content. Primarily we will focus on the architectural aspects of processing and using unstructured data. Remember, there is no value in data/information unless it is used.
Advertisement
Why Unstructured Data/Information (UD/UI) is Complicated
In his column entitled, "The Integration of Unstructured Data into a Business Intelligence System" on December 21, 2004, William McCrosky presents a nice view of the unstructured data spectrum. I repeat it here for reference.
1. " The source document is paper, not electronic. Insurance, medical and human resource forms are often paper-based." Note that the data of interest, in this case, is reasonably structured. Its position on the source document can be spatially located.
2. The source document is structured - as in example 1 - but it is already in an electronic form. A Web technology such as XML may be used.
3. The source document is electronic, but the data of interest is not structured. E-mail and word processing documents fall into this category.
4. The source document is paper, and the data is unstructured. There is no electronic representation of the document - perhaps a historical document prepared before the advent of word processing.
5. The source is a "blob," not a document - such as pictures, voice or video.
Bill Inmon expressed it this way recently, " Reading unstructured data is merely the first step in starting to filter it out. After the unstructured data is read, it needs to be edited and prioritized. The problem is that the unstructured data is exactly that - unstructured. There is no structure or format for the data; therefore, getting a handle on what is important and what is not important is no small feat." DM Review Magazine, December 2004.
These two excerpts point out the difficulty and complexity on getting a handle on how to look at what is important within the UI. Fortunately, enough work has been done with this "stuff" to present a few options around development of an architecture and technology to manage UI/UD.
Fundamentals Components of UD/UI Architecture
There are many generic components to consider when developing the UD/UI solutions for your organization
These are:
- Taxonomy - A taxonomy is a hierarchical classification structure, such that it descends from broad to specific or from parent to child. The UI-UD architecture demands an effective taxonomy. This is a step that cannot be avoided. In our structured data warehouse-focused decades, most of us were able to sneak up on delivering a product without substantial (if any) meta data. UI/UD demand organized meta data.
- Ontology - describes the rules and views of the taxonomy. Think of taxonomy as a hierarchical logical model, and ontology more as logical tables or networks, i.e., views with triggers - crude but gets you there. Alternatively, an ontology is a way to organize taxonomies (and other expressions of data relationships), "An ontology is a formal way to organize knowledge and terms. Typically ontologies are represented as graphical relationships or networks, as opposed to taxonomies which are usually represented hierarchically." And example would be to find Cabernet in a query. The ontology would know that Cabernet is a type of wine as well.
- Content acquisition - regardless of how it is viewed and arranged, content is the "stuff" that is read, the real instances of UI/UD.
- Parse - no matter what tool or approach, at some point, UI/UD needs to be chopped up into bits to be presented, summarized or analyzed.
- Tag or ascribe - semantics is a growing science, but content still needs to have meaning and context Semantic engines combine taxonomy and ontology into an expression of context. Or to wax philosophical, meaning vs. definition. Whether the meaning is extracted out of the data and stored as structure or the content is tagged in some meaningful way is irrelevant. All UI/UD needs to be examined and have some context assigned.
- Management of UI/UD - like any other content to be managed, UI/UD needs some basic functions that are made quirky due to the nature of the content.
- Memory and storage - Most likely UI/UD will occupy lots more disk and require lots more memory to deal with. Some of the products on the market are calling for many gigabytes of memory to manage the ontology schemes.
- Community management - UI/UD is useless unless it can be moved about and shared. Determining who can collaborate, view and share is a function of business needs, cross-functional process design and regulatory governance.
- Content management - Loading, tracking and storing with a good address, index and viewing platform is mandatory. UI/UD may go beyond the capability of some content management packages, however, so be prepared to look into less common software such as that used in the film and news industries.
- View and use - Invariably, you might say, "There is a fact in this document" and another party will say, "Fine, show me the document and the fact." Therefore, query tools and reporting take on combining UI/UD with traditional "rows and columns."
Technology Approaches
Page 1 of 2.






