When Poor Data Quality Strikes Home
It’s easy for us information systems professionals to listen to a story about some other company’s data quality issue and how it cost them thousands (or millions) of dollars and feel relieved because we believe "that particular affliction cannot happen here." In fact, we probably believe that it is even more remote that we can be the cause of a data quality problem. However, I took an informal survey at a recent conference which revealed a different story.
There is an information system which many of us use of on a month-to-month basis. In using this system, each of us participates in one or more of the following functions: accounts receivable, accounts payable, auditing and tax appropriation. That’s right, it’s your checkbook. In my family, my wife Linda and I share a single checking account. Linda usually carries the checkbook and writes most of the checks. Occasionally, I’ll take it with me to the supermarket. Like me, most of the folks in my informal survey acknowledged that we might be a little "loose and fast" when it comes to entering data into our checkbook ledgers. In my case, when I’m at the supermarket, I hastily enter an amount into the ledger, but not the check number or the party the check was written to. Some people acknowledged that they forget to enter checks at all and, in some cases, the wrong amount might be entered. (As I write this column, my college-bound son, Michael, reminds me that he forgot to enter an ATM withdrawal into his checkbook.)
In any event, if it’s like my household, terror strikes at the end of the month when its time to balance the checkbook. Each data anomaly contributes to the time and frustration it takes to finally complete the monthly reconciliation. Fortunately, our suffering has led us to pay more attention to data quality. As a first step to balancing the checkbook, we make sure that there is a ledger entry for each check and that it was correctly entered in the checkbook.
This manual, row-by-row examination is not practical in our databases that may hold thousands or millions of rows. However, a data quality assessment is an effective means to point us at fields where problems exist so that we can detect where data anomalies are, fix the flawed process that has created the data and fix the flawed data itself.
The Data Quality Assessment
Last month we discussed why the data quality assessment is important. Given the alternatives, there really is no other reliable and timely way to measure and manage data quality. This month we’ll explore how a data quality assessment should be conducted.
What to Assess
First of all, it is important to tie the assessment to a project that will benefit from the quick results provided by the assessment. In the business intelligence practice where I work, the data quality assessment works best when it is performed during the data survey as soon as the data warehouse source data has been identified. The most effective assessments are those that are performed closely with the users. Users should be encouraged to prioritize the source fields, to provide the appropriate meta data definition and validate the results of the survey when it is completed.
As I stated in last month’s column, as information systems professionals, we must be able to measure what we manage. As managers of data, we need measurements that will tell us how good or bad our data is. We need a statistical report showing how closely our data complies with the definition and business rules we have established for it. To perform the data quality assessment we will need two things: a definition and business rules for each field of the assessment and a certification or a grading system.
The definition and business rules (a.k.a., meta data) will determine the type and size of the field, its allowable set of values, whether the field is required, if it can be blank or null and the relationship it has with other fields. The certification is implemented with a grading system that is understood by all parties to quickly explain how closely the data corresponds to its meta data. On projects I have run, a letter-grading system proved adequate (i.e., A, B, C, D, F). A percentage of rows that pass the respective test were associated with a corresponding grade. For example, an A+ meant 100 percent compliance with the meta data; a B- might mean that the field had 98 percent compliance, etc.
At the field level, there are three levels of certification tests that can be run:
- Field type integrity
- Business rule integrity
- Customer data integrity
Field type integrity is exactly what it says it applies measures based on the type of field it is. First, it measures for existence (the percentage that’s not null, blank or defaulted) scoring for required fields can thus be calculated. Further tests are applied based on its field type a date field will be tested for real dates (measuring occurrences of bogus dates such as 02/30/2001) and numeric fields for the occurrence of nonnumeric data.
Business rule integrity requires more meta data than field type integrity. Field size and legal domain (or range of values) will allow us to measure the percentage of rows having an allowable value. Often, fields have relationships with other fields that further define its allowable domain set. For example, if AGE is 15 or less and the STATE field is "NJ," then the DRIVERS LICENSE field must be blank.
Customer data integrity pertains to those checks that are performed on data fields that identify and describe people, households or businesses (name and address fields, phone numbers, e-mail addresses, etc.) associated with your customers (see Figure 1). Data quality assessment is particularly critical when multiple data sources of customer data must be consolidated or integrated. It is also critical when you are not sure if and how many customer records have been duplicated in your database.
Figure 1: Is this the same person?
Customer data integrity usually requires a specialized tool and associated tables to perform adequate testing (i.e., Firstlogic, Trillium, etc.)
Ultimately, our statistical reports will explain how closely the data complies with the meta data you have been using. However, those in the IS department are not the best suited to make the final judgment as to where the problem is. If the field shows a problem, IS wants to know if the problem with the data or the meta data. If the field Customer_type is scored a C because three percent of its rows equal "P," IS wants to know if "P" really is an allowable value or if these rows are flawed. Only the subject matter expert (SME) can make this final determination. Each certification report needs to be reviewed by an SME. In my experience, more often than not, the meta data will need to be enhanced and the score adjusted.
Byproducts of the Data Quality Assessment
There are several beneficial byproducts of the data quality assessment. First and most importantly, the assessment team is creating a new and enhanced set of meta data. During the assessment process, the team has been gathering meta data, interviewing SMEs, validating reports and enhancing meta data. This enhanced meta data must be stored in a repository and made available to the entire enterprise.
Secondly, the certification reports serve as an important piece of newly created meta data. In projects I have led, we consolidated the certification reports with the other available meta data. In the data warehouse environment, a portal served as an ideal place to give users access to both data and meta data including the data quality certification reports. The posting of meta data became part of the knowledge management function where methods of posting and searching for field meta data were provided to the enterprise.
Finally, the testing and analysis of data quality often leads to a discovery process where the team discovers new field relationships. For example, performing a data quality assessment was the first step of a data warehouse project my company performed for a life insurance company. As we compiled certification reports, we decided we could quickly run a series of tests to determine if there was a relationship in the sequence of some of the significant events of a policy. In one case, we tested to see if there was a consistent sequence between POLICY_ISSUE_DATE and POLICY_EFFECTIVE_DATE. We found that there was no such sequence that both fields were usually the same date and, when they differed, there was no consistent pattern. However, we also observed that in a small number of policies, the issue date and the effective date differed by as much as 15 years. Bringing this peculiarity to the SME’s attention confirmed that we uncovered an unusual problem that no one had seen before.
In next month’s column we will examine an organization that for the better part of the 20th century maintained perfect data quality.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access