Two Worlds of Data – Unstructured and Structured
Information Management Magazine, September 2004
Google, one of the premier free-form search engines on the planet, may be getting a little skittish about the Microsoft Longhorn project. Google wants to unleash its search technology on the enterprise and on the desktop, but with Longhorn, Microsoft plans to have that capability built into its operating system and its new file system. Is free-form searching really the next battleground? Or is bringing together two needles in a haystack of information really the holy grail of search technology?
In the new category of enterprise software called business performance management (BPM), bringing together the worlds of structured and unstructured data can add significant value to the enterprise. BPM fosters new levels of corporate accountability, financial rigor and tangible value creation across the distributed global organization. It is driven by the imperative to align internal and external constituencies with business objectives through real-time availability and continuous exchange of financial, transactional and operational information. Effectively implemented, BPM enables enterprises to better shape and influence business outcomes by improving the caliber and speed of decision making. With it, executives can anticipate and respond to shifting market dynamics, intelligently allocate and utilize critical resources and consistently meet management and shareholder expectations.
Data in BPM
Advertisement
People use unstructured data every day. Although they may not be aware, they use it for creating, storing and retrieving reports, e-mails, spreadsheets and other types of documents. Unstructured data consists of any data stored in an unstructured format at an atomic level. That is, in the unstructured content, there is no conceptual definition and no data type definition - in textual documents, a word is simply a word. Some current technologies used for content searches on unstructured data require tagging entities such as names or applying keywords and meta tags. Therefore, human intervention is required to help make the unstructured data machine readable.
People also use structured data every day. Structured data is anything that has an enforced composition to the atomic data types. Structured data is managed by technology that allows for querying and reporting against predetermined data types and understood relationships.
Two Categories of Unstructured Data
Unstructured data consists of two basic categories:
- Bitmap Objects: Inherently non-language based, such as image, video or audio files.
- Textual Objects: Based on a written or printed language, such as Microsoft Word documents, e-mails or Microsoft Excel spreadsheets.
Both of these object types may be classified as data, but the technology and methodology for harnessing relevant information from bitmap objects is still in its infancy. Most of today's technology addresses textual objects. Enterprise content management (ECM) technologies, for example, can help contain unstructured data. Textual data mining and analysis vendors provide analysis tools for unstructured textual objects, and business intelligence vendors supply solutions for querying and analyzing structured data. However, bringing them together - querying both the unstructured and structured worlds - and then associating these two worlds at relevant points is where the most value is gained and also where the highest level of challenge is presented.
Comparing these categories with structured data raises three distinct challenges:
- Even if unstructured data is in a format such as a Microsoft Word template, the data is still not consumable from a semantic level without a compatible interface or application.
- Even with a compatible technology, we cannot necessarily gain insight into the context of the information unless we can actually read it.
- And lastly, the way we interpret what we read is largely subjective.
"A Picture is Worth..."
One of the challenges when dealing with unstructured data is the written word and the fact that it often does not communicate the exact meaning intended. There is a stark division between the written word and the spoken word. The phonetically written word sacrifices worlds of meaning and perception that were once secured in hieroglyphics and still are in the Chinese ideogram. Alphabets such as this provide gestalt - an understanding of the whole within the picture. The Western alphabet lacks the ability to distinguish context and concept from the symbols.
A recent Wall Street Journal article provides a good example of why it is not necessarily appropriate to assign a qualitative value to unstructured data. The Wall Street Journal performed an analysis of a collection of high schools, both public and private, and calculated the percentage of graduating seniors who were accepted to Ivy League schools. At the top of the list was a private school in Brooklyn, New York, called Saint Ann's. Saint Ann's came in first with a whopping 41 percent of graduates gaining admission to 10 of the nation's most exclusive schools, such as Yale, Harvard, Brown, Duke and Cornell. Saint Ann's even beat the Hopkins School in New Haven, Connecticut, where 51 percent of the students in the senior class this year were National Merit Scholars.1
The interesting point about Saint Ann's and its success rate comes from the way teachers assign grades to their students. They don't. While at the school, students get written reports about their achievements and areas that need improvement, but there is never a quantitative number or letter assigned to their work. Upon graduation, students receive personal essays about their work from the school's headmaster. Therefore, Saint Ann's college applicants cannot be placed in a "GPA bucket" with other applicants. Each application from Saint Ann's must be read to acquire a full picture of the student. Students from other high schools may well be disqualified at the gate because of a poor GPA - a number assigned to a student meant to represent the quality of knowledge or learning the student has achieved. Perhaps there is some unstructured information that is not meant to have a number value assigned to it as a predicate of value.
Two Approaches to the Problem
Page 1 of 3.






