The foundation of any successful business intelligence (BI) project rests on accurate, clean data. In fact, data quality tools should be a precursor to any BI exercise. The challenges that companies continue to face regarding data quality are mostly around free-form or unstructured data. Unstructured data - from such sources as forms, email or documents -contains a great deal of information that can be usefully employed in a BI system, but first it must be accurately captured and cleansed.

Unstructured data is what people like to read. I like to read magazines backward, but maybe that’s just the wrong structure. To a computer, the magazine itself is unstructured data - the messy data form that is natural to people, but hard for machines. Unstructured data often contains important information because it is easy for people to read and produce.

Everybody sets up a spreadsheet to get a handle on complicated number relationships that affect their jobs, or jots down in an email notes to self. Interestingly, a scanned letter is often considered structured data because it is a digital object that is neatly indexed and tucked away in a relational database. But the real information is in the text, in which a human being makes a statement, urges action, commends a result, etc.

BI people are big on charts. Bar charts, graphs and pies in various forms and colors. If you can count it, you can display it in a line graph or as a series of percentages. A dashboard is another big favorite. That’s a paper chart that moves a lot - hopefully in the direction you want.

This is great when your data is neat and always the same, no matter how often you look at it. BI systems are typically connected to one or more databases from which they get their input. The assumption is that the data is good, it isn’t corrupted, and it’s in a form that the BI system can digest.

But how do you chart a graph of the number of people that have emailed complaints about a common product feature in a given geographic area? How do you get a grip on the data in spreadsheets that are sitting in people’s workstations? Here’s a really tough one: Try displaying the number of Web pages from dreamyvacations.com on everybody’s laptop that recently checked into trips to St. Maarten.

This is the problem with unstructured data – it is in a format that you can’t use to understand its relationship to other data. Moreover, data will be “coded” completely different from each other, but will mean the same thing to the casual observer. For example, one way to help out is to standardize the data (i.e., interpret its meaning into a common form that can then be understood by a BI system). This is called normalization. For example, a name such as “3m Corp” might be normalized to “3M Corporation” and identified as a business name. “3M Corporation” might even have a bunch of additional outside data associated with it, such as DUNS ID and a main address, telephone number and Web page, along with the names of corporate officers and a number of business locations. This is called augmentation - add immediately associated information after a key is normalized.

Once you normalize data, you can augment to it and you can analyze or display it. If that’s what you want to do, then normalization is a big deal.

By some estimates, the U.S. public wrote 2.5 trillion emails in 2006.1 Most of it was spam, but still, adjusting for that you’re left with 800 to 900 million emails between actual people. This is a remarkable number, and you probably feel that on some days a lot of that is in your inbox.

Identity Resolution

Somewhere in this sea of information are questions, or answers, from consumers to their insurance company, phone company, etc. If you visit the call center of a health insurer, you will find workers hand sorting through thousands of emails a day from customers, containing all kinds of valuable information.

An email is a combination of structured data, in the form of a header, and unstructured data, in the form of the body. The header and routing information is formal as it is processed by machines. The body often contains free text with embedded information about the transaction. This can be names, product names or number, places, etc.

If these identities could be gleaned from the text body, normalized and associated with the formal header information, the email could be partially processed by programs and provided to a customer relationship management system. That would significantly boost customer service productivity, not to mention an improvement in service quality. Let’s face it, how often do you actually fill out an online form? What you really want to do is just send an email from your Blackberry and be done with it.

The process of picking out useful information from email text is called identity resolution and belongs in the area of data quality processing. A data quality system can parse the free text and identify names, places, codes, Social Security numbers, phone numbers, etc. with surprising accuracy. Of course, it’s still not 100 percent accurate, but when you are faced with 5,000 or 6,000 emails per day, 60 to 70 percent identity resolution accuracy is a real boon.

The parallel benefit for BI is huge. Most customer service managers have no quantified information about what all these people are emailing about. The reason it works is because workers and supervisors have a familiarity with what’s coming in, so managers receive anecdotal information. But anecdotes are not BI, and only quick and accurate BI enables a required degree of agility and customer responsiveness.

Metadata Quality and BI

As important as BI output is in enabling agile understanding of business dynamics, it is equally important to have confidence in its meaning.

For example, an online display in a transportation business might show how many trucks are underway, and how many cubic feet of space are productive (i.e., paid for). If you see that 60 percent of all available space is booked, how accurate do you suppose that is? Could there be late reports, broken-down trucks, shipments incorrectly coded? The confirmation or mere suspicion that BI data isn’t accurate is a major data quality issue. If you want a classic example, recall the 1999 Mars Orbiter mishap when Lockheed Martin disclosed operational parameters in the English measurements while NASA was using the metric system. Same floating point numbers, different metadata, $125 million loss.

Information displayed in BI systems should be associated with information that describes its accuracy. Because most BI information is expressed in ratios of one sort or another, the number of measures or completeness of these measures can dramatically influence the resulting ratio. This is metadata that describes BI useability.

Data quality systems are the appropriate locus for the generation of BI metadata. The process of normalization and augmentation creates accuracy information that should be captured, stored and presented to the BI utility as metadata. Further, continuous monitoring of data with continuous quality monitors yields conclusion about its accuracy, and these conclusions are vital for BI accuracy.

What does this mean for data modelers and database developers? The needs of BI systems must be built in at design time, specifically, indicators and metadata on the quality of data, and descriptors on the quality process itself. Normalization, augmentation and augmentation subsystems need to be constituent in the data update stream, and their process outcomes should be recorded.

Unstructured Comments

Identity resolution and metadata quality form the two prongs of the fork with which to skewer the unstructured data BI problem. First, identity resolution is a data quality process that yields key data from blocks of free-form text. Identity resolution provides its success criteria in the form of metadata, indicating what exactly is contained and how reliable the data might be. BI systems can then provide confidence measures about quantified information derived from unstructured data sources.

People feel most comfortable with unstructured data. It’s the music they hear and the text they read. So it isn’t going away; it is the world we live in. Data quality systems play a crucial role in bridging the facile understanding that BI systems deliver with critical information contained in unstructured sources.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access