Editor’s Note: This is the first article in a series focusing on assessing data quality issues. The next installment will be published on Friday, September 22, 2000, in DMReview.com’s online columnist section ( http://www.dmreview.com/onlinecolumnists/)

"One accurate measurement is worth a thousand expert opinions"
(Grace Hopper, Admiral, U.S. Navy)

About every two years, I pay a visit to my doctor for a routine check up; my last one was just a few months ago. I’m sure you know the routine – weight, blood pressure, pulse rate, EKG, a little probing here and there. But, the face-to-face encounter with my doctor was not the end of this saga. I was sent down the street to a lab where two vials of blood were drawn and sent for routine testing. The analysis of that blood give the doctor key measurements that would tip him off about the health of my bodily processes. For example, numerical measurements of my glucose, electrolytes, enzymes, proteins, blood fats and hormones that fall outside the normal range may indicate a problem with a specific part of my body. So, too, an examination of a corporation’s data can uncover flaws that indicate problems with the business processes that produce it.

Beware of the App with No Data Quality Problems

Ever notice that some of the most important principles we learn in business are also some of the simplest. One of the first such principles I learned was " (As a manager,) you must be able to measure that which you seek to manage." It was so simple that I didn’t pay much attention when I first heard it 15 years ago. But, that principle cuts to the very core of what business intelligence and information systems are. How profitable is product XYZ? How profitable is your customer, company ABC? How profitable is the enterprise? You can bet that jobs depend on these measures – product manager, sales manager, CEO, etc. Better yet, how about the question: How clean is your data? Assuming someone in the enterprise is accountable for data quality (and that’s a big assumption), what measure can be used to answer the question?

Does your data management organization know the level of your data quality? In many of the organizations I’ve seen, the answer is a qualified Yes. That is, they know the data is flawed because users are complaining about their reports, files and applications. But, even these organizations don’t know the full spectrum of their data’s level of quality. Worse, yet, is the enterprise that thinks its data is good because their users are not complaining – unfortunately, this is often a result of users who have given up hope of having data quality improved.

Thomas Redman, in his book Data Quality for the Information Age, indicates that companies typically choose one of five methods to find out about the quality of their data. His list includes:

  • Customer complaints
  • User interviews
  • Customer satisfaction surveys
  • Data quality requirements gathering
  • Data quality assessments

To that list, I can add that complaints from the analysis and/or development staff are responsible for a data migration effort. Projects attempting to integrate data are especially vulnerable to data quality issues. A recent study by the Standish Group states that 83 percent of data migration projects overrun their budget (or fail) primarily as a result of misunderstandings about the source data and meta data. Similar surveys conducted by the GartnerGroup point to data quality as a leading reason for overruns and failed projects.

But, My Apps are Running Just Fine

You might be thinking, "My apps are running just fine. Our mission-critical, back-office or front-office system runs with little or no problem." You might be right. Problems with mission-critical data would have been reported and cleansed in the initial months of the system’s life. However, these systems often collect lots of data that are not used in the mission-critical process. For example, telecommunications companies often build their data warehouses from data collected from their billing system(s). Telco billing systems are nearly flawless at getting the bills out to the customer. The data you see on your statement has a high level of quality. However, when other data is needed by a downstream system, such as with a data warehouse, it may show signs of neglect. I once had this problem while trying to roll up data from the billed-to account data to its parent level, the customer. There was no problem with the account data that was used on the bill. However, the data collected for the ultimate "customer" was riddled with problems.

Data in Context

If you have accepted that a data quality assessment may be a worthy cause, I’d like to share some good news and some bad news. The good news is when you’ve completed the assessment, it will yield good meta data. The bad news is you will have to assess (or create) meta data as part of the process. Let me show you why. Can you determine if the following table contains any data quality issues? If yes, what are they?

44321
31122
32121
93422
54432
.….

Figure 1

When I ask this question, someone usually responds, "You can’t tell" or "There isn’t enough information." This is exactly the point. Without some criteria to make a judgment about the data, you can’t say how good or bad the quality of the data is. Now, what if I supply the following supplemental data:

- Metadata Dictionary -
The table contains a single field taken from an open order list, the field is described below:
Field: ord_no     Long Name: Order Number
Type: Integer
Range: 00000 – 89999
Description: Order Number is generated by the SOP system (Strategic Order Processing System).
An order number uniquely identifies an order placed by a customer.
Order numbers are assigned sequentially. Line Items roll up to a single order number.

Figure 2

Now can you determine the data quality for this field? You sure know a lot more about the table. It contains the data of a single field called ord_no. The field contains order numbers, an integer field, ranging from 0 to 89,999. We were provided the name of the source system. We were told that values are to be unique and that, somewhere in the enterprise, line-item data has a dependency to the order. We can, in fact, begin to make an assessment of the data’s level of quality. Careful analysis shows that the fourth value in the table does not meet our criteria for valid data – so our assessment is that the data quality for ord_no is something less than 100 percent. This example illustrates a critical point – a data quality assessment is a reconciliation of data to its meta data.

It is not possible to know your data’s quality without a field-by-field assessment. Furthermore, simply having the data is not enough – you must have the context for which the data is to exist. To put it in other terms, you must also have the meta data.

The More Plentiful the Meta Data, the More Extensive the Assessment

Of course, finding or creating meta data is an issue unto itself. Generally speaking, one can find meta data in data models, data dictionaries, repositories, COBOL copybooks, specifications, etc. If current meta data does not exist, then a subject matter expert will be needed. In many efforts I have been involved in, meta data did not exist. In those cases, statistical results generated from the data quality assessment provided much needed intelligence about the data. A much-desired byproduct of a data quality assessment is meta data; the greater the participation of the subject matter experts to validate our findings, the better the quality of the meta data and the assessment.

The assessments I have conducted focus on one or more of the following types of quality criteria:

  1. Data type integrity
  2. Business rule integrity
  3. Name and address integrity

If my team knows nothing more than field names, types and sizes (for example, if only a COBOL copybook is available), then our focus will be on testing the field’s integrity based on its type (numeric, alphanumeric, data, etc.). If we are provided additional characteristics of the field (domain, relationship with other fields, etc.), then business rule integrity should also be performed. Finally, if name and address data is critical (particularly if it will be consolidated with other data), then name and address integrity testing should be performed. (We will discuss more about these tests in next month’s column.)

Bugs and Viruses

How many of us were waiting in corporate emergency rooms last New Year’s Eve awaiting the results of the corporation’s Y2K corrective surgery? For the last few years, all eyes were focused on avoiding the fatality that would result from ignoring the Y2K bug. Estimates ranged from $5 billion or more worldwide on how much was spent to eliminate this anomaly. Y2K was an easy diagnosis to make; there was a worldwide alert to inform us about this electronic plague. How many more problems remain hidden?

Yes, we can wait around, perhaps having a false sense of security, believing that we are in tip-top condition. But are we fooling ourselves? On day-to-day basis, I don’t think too much about my health – other than to try to exercise and eat healthily. But, I have to admit, after being told by my doctor that I had a clean bill of health, I had peace of mind. It’s a sign of the times that we can talk to each other about our level of "good" and "bad" cholesterol – a measurement created from our last blood test. But, how about those of us professionals responsible for data management – do you know the quality of your data?

In next month’s column, we will discuss how to conduct a data quality assessment.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access