Continue in 2 seconds

Toxic Data

Published
  • June 01 1998, 1:00am EDT

You should expect that when you first load data into your new data warehouse or data mart, chances are it will be of dubious quality. In fact, it may not be usable without a toxic data warning label. Examples of data quality problems, found after warehouses are loaded and knowledge workers query them, could provide enough laughs for a stand-up comedy routine. Consider, for example, the insurance company which found an unexpected number of hemorrhoid claims in a particular region. When asked why, clerks said they had no place to record "pain in the a--" customers, so they chose an appropriate claim code. The claims were paid as long as there was a valid claim code. "We had no idea anyone else was using that information," they said.

The solution most often prescribed for less-than-hoped-for data quality is a data cleanup effort, sometimes called data scrubbing. This approach involves a massive and ongoing effort to validate and correct data after you extract it from source systems and before you load it into your warehouse.

A Different Approach

Data cleanup efforts are like building a pollution treatment plant downstream on a river to clean up toxins dumped upstream. Wouldn't it be better to go to the pollution sources and work there to prevent the poisons going into the river in the first place? In the same way, it will be more effective if, rather than expending resources in a never-ending cleanup effort, you selectively renovate your source systems so they produce clean, rather than contaminated, data.

But modifying source systems using traditional methodologies will be too time consuming, too costly and, if history is any measure, too risky. To be successful in source system renovation, what you need is information--hard facts--on the inconsistent data definitions, data validations and other source system incompatibilities before you start. In other words, you need to know what the business rules are in each of the source systems and how, when viewed from a cross-functional perspective, the current rules produce inconsistent data results.

A very powerful tactic for discovering dysfunctional business rules is mining the warehouse/mart itself. You can do this by extracting and loading the business transactions which define the event life history of your data mart's subjects. Business transactions are recordings of system responses to business events. Every potential data mart subject (customers, products, orders, accounts, etc.) has an event life history. For example, an order can begin its life as a proposal and then become an entered order, an accepted order, a fulfilled order and finally a billed-and-paid order. Insurance policies are first proposed, then rated, bound, premiums are paid, claims are made, claims are paid, and it is renewed or canceled.

As you store each business transaction into your mart, you apply to it a new set of cross-functional business rules. These validations, called life history business rules, are based on the expected sequence of events. For example, when you record the installation of a piece of equipment at a customer location, you should check that a shipment transaction has preceded it. If it has not, you have a miracle. You are getting paid for equipment you never built. Of course, later on, a "shipped not installed" data mart report will tell you about products that left your plant and apparently vaporized. When you do life history business rules checks, you will discover real business problems, and they will be many and some even bizarre.

So go ahead and load your data mart/data warehouse, but don't make it immediately available to users, at least not without caveats about suspect data quality. This can take some political will and savvy. What you have to sell is that the immediate value of the warehouse is not information for decision making, but for discovering the fundamental business rule issues which led to poor data quality in the first place.

A Real Example

Here's how one company applied the life history business rule approach.

A well-known computer manufacturer needed solid data about its installed base of equipment. Installed-base information could help people do a better job of managing current operations in many different areas of the business and also support opportunities for marketing new products and services.

The person most annoyed by lack of good installed-base information was the vice president of customer engineering--Fred (not his real name). As only one example, when an engineering change (EC) needed to be made, Fred wanted information on how many pieces of equipment were involved, where they were located and the current EC status of each so he could plan and track the change effort. He couldn't do it. It was impossible! Some equipment wasn't found at the customer site at which Fred's systems said it was located, some serial numbers were simply invalid and engineering change levels were often incorrect.

To improve the amount and accuracy of the information available, Fred funded and championed (this is important!) a data mart project. Fred acted as a good corporate soldier because he knew that others--marketing, for example--would also benefit greatly from the availability of good installed equipment information. The subject of Fred's data mart was the individual piece of equipment identified by a unique serial number. You will see later that like social security numbers, serial numbers did not always turn out to be unique.

The Approach

First, Fred's project team identified the source systems that contained the business transactions which define the equipment life history. See Figure 1.


Figure 1

Then they interviewed representatives of the functional IT organizations owning each source system to document current business rules including data definitions and data validation rules. This can be a very "interesting" exercise. At this point, if there were any hopes about data quality, those hopes were dashed. For instance, they found that many fields contained data values unrelated to the field name, and some code fields did not contain information about a single fact. For example, the SHIPPED-COMPLETE code field, originally holding a Y or N value, also contained codes describing the reasons for late shipments.

Next, analysts interviewed potential data consumers in different departments, such as marketing, to determine the kinds of questions which might be asked of the installed-base data mart. They built a simple data model for the data mart which included a "transaction stack" for storing transactions by date. They also established a meta data repository. This repository contained source system, transformation and data mart business rules.

The project team wrote programs which they inserted into each source system's nightly batch run. These programs extracted the day's transactions and transmitted them to the data mart system for processing.

At the same time, a dedicated data integrity team, nicknamed NORAD, was established. NORAD's responsibilities included operational control over all source system extracts, all data mart posting and, most importantly, mining and analyzing results for data quality problems.

At the data mart, NORAD posted individual transactions sorted by date. Then they applied the life history business rules to each event, checking for the expected life history sequence. For example, each return from customer transaction should be preceded by a deinstallation at the customer site transaction. Each exception found was posted, together with the transaction itself and the related events, to an exception database. There were so many exceptions that the exception database became larger that the mart itself! See Figure 2.


Figure 2

NORAD mined the exception database and analyzed the results. They established data quality measures, established baseline quality values and tracked progress.

Here are some examples of problems NORAD identified:

  1. Duplicate serial numbered equipment was leaving manufacturing because of poor employee training.
  2. Inaccurate serial number recording was rampant in all systems. For example, these real serial numbers were reported as shipped; I, WANNA, RAISE, THIS, JOB and STINKS. Now think about this! These serial numbers were transmitted each night among departments trying to do customer billing, maintenance contract administration, asset management, etc., using these serial numbers! Many customer invoices were not being paid because of inaccurate equipment serial numbers. Also, "defensive" work-arounds existed to creatively deal with these data feeds! The cost in non-productive clerical effort was enormous, and roughly forty percent of computer programs in nightly batch runs were designed solely to handle suspect data. For example, in one system when a duplicate serial number arrived, the system automatically added the characters DUP to the serial number giving, for instance, 12345DUP.
  3. Some extracted transactions did not constitute a business event. For example, an equipment installation event was recorded as an add posting to the database, and an equipment move event was recorded as two transactions--first a delete and then later an add. As a result, when an add transaction was posted, it could not be determined if it was an installation or the second half of a move.

The dates on which business events occurred were not recorded. Source systems used database update dates as transaction dates. So data entry schedules had an impact on life history analysis based on business event dates.
After several months, detail and summary exception database results were shown to top management. Provided with hard information (including real examples) about poor data quality and its costs (not to mention its impact on building effective data marts), top management took action. "Took action" is a euphemism for stronger language. They made resources available to selectively renovate source systems to enforce new, more stringent and, most importantly, consistent cross- functional business rules. For example, no equipment left the shipping dock until the new serial number and model number edits were passed, even if shipping boxes had to be opened. Note that these procedures did not benefit the distribution function - they often slowed their operations. The benefit to the remainder of the company, however, was enormous. The same serial number validation rules were reused in all systems.

Almost a year's time after the initial load, the installed-base data mart was clean enough to offer data consumer access. Success! The data mart effort was successful because its customers could trust the quality and integrity of the information they extracted.

Lessons Learned

These things are NOT optional when building a successful data mart/data warehouse:

  • The political support of someone with clout. Find a champion with a real business need for quality information. Find your Fred.
  • Fred must understand that after spending a lot of money, the initial value of the data mart will not be good information but will provide the opportunity to have excellent information in the not too distant future, after spending more money.
  • Focus on business requirements for information. A "build it and they will come" approach won't work.
  • A simple data model. Do keep it simple. Don't over-model. All you can be sure of is that the mart will evolve.
  • A meta data repository which includes all the business rules. This repository holds enterprise business rules for impact analysis and later reuse.
  • A data integrity team staffed and dedicated to data warehouse/mart data quality.
  • Data quality criteria, constant measurement, feedback and corrective action.
  • Estimates of the costs of poor data quality in key business activities. For example, unpaid invoices and installed equipment replaced by competitive products.
  • When your data mart is finally open for business, proclaim success!

Benefits of the Life History Business Rules Approach

Focusing on life history business rules will help you greatly reduce your risk of data warehouse/mart failure. Business rules shift the focus from technical issues to the terms and measures business people understand. You can give solid information on the state of the organization's data resource and the costs of bad data. For the first time, management will see the impact of poor data quality, not just on building successful data warehouses and marts, but in degraded performance throughout the organization. With the business rules approach you can engage business people in your efforts and get the management support you need for success.

This approach brings the knowledge of business people and the expertise of IT maintenance staff to bear. You need no new technical skills, products or training. Your current systems investment is used and also enhanced as a byproduct.

Invest in Data Quality

The surest way to insure data warehouse/data mart failure is to ignore data quality issues. You can have a technical success and a business failure. Flawless execution of warehouse building illuminated by brilliant data models and using the latest servers, middleware and OLAP tools cannot compensate for inaccurate, incomplete and inconsistent data contents. Prepare to invest prodigious amounts of time and effort on data quality. Using the life history business rules approach can greatly improve your chances of success. An initial success will generate the demand for further warehouse/mart development.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access