After 20 years as a data warehousing practitioner, I've seen my share of data quality issues. The vast majority are unexpected and cost extra development time and budget money. Many have harmful impact on the business. Bad data quality has become an unfortunate fact of life for many business users who learn to understand and accept the reasons for bad data and do the best they can.

There are other situations where bad data quality is lurking in the system and neither the business nor IT even knows it. For example, sales transactions may have missing product category codes. If the reports try to match sales transactions to each product category code, the missing ones will be omitted and sales figures will be underreported.

Recently, I was working on a project for a company that had the classic set of data quality issues. The downstream impact was a negative company image as customers lost respect for a company that couldn't keep its contacts straight. For instance, the company did not standardize its name and address data. Hence, there were many duplicate records. The immediate result was duplicate mailings and multiple agent contacts. Another problem was poor quality data from partners' external data feeds. These required more phone calls and delays to resolve daily problems. The company also suffered from:

  • Data that wasn't synchronized. Daily transaction files didn't match with customer data.
  • Data that was extracted by an individual user, then passed around to other users who didn't understand the context of the data or its relationship to other applications.
  • Operational applications that failed to rigorously control screen input checks and caused bad data to flow into the data warehouse.

Formalizing a Data Quality Program

We recommended the establishment of a  formal and sustained data quality program. We developed a data quality program in a series of phases. The first order of business was to relay the sense of urgency about poor data quality to management.

Proof-of-Concept Phase

When it comes to data quality, Pareto's rule usually applies: 20 percent of the data will cause 80 percent of the problems. We started with the areas that traditionally had data quality problems: customer name and address, legacy data sources and external feeds from business partners. We interviewed users who accessed customer, legacy and external data. They immediately pinpointed dozens of interesting data quality problems they saw on a frequent basis. We extracted data sources and profiled the problem data in a variety of ways: duplicate names and addresses, numeric data that contains stuffed comments and flags, codes that do not exist in code tables and transactions that do not match customer and account data.

We documented these problems in a business context to illustrate how poor data quality costs the business in terms of wrong decisions and reconciliation efforts. We were able to quantify how multiple mailings and customer contacts cost money, and how bad codes and orphan transactions distorted report results.

Adoption Phase

Once we proved to executives that correcting data quality problems would help the business, we needed an action plan. We created data quality implementation standards describing common design approaches. For example, we established mechanisms on data structure design, how to handle error detection and reporting, and how source systems need to be corrected. Once these standards were in place, we applied them to each subject area to profile, report and correct discovered data problems. We tackled subsequent subject areas in the same way. Interestingly, we found that most tables did not have any quality problems — the 20 percent of legacy data and poor table design was mostly to blame.

Production Phase

A common data quality misunderstanding is that once data is fixed, it stays fixed. The truth is that new data and changes to existing systems generate new data quality problems which need to be reported and corrected continuously. For this reason, the term "program" is more appropriate than "project." Data arriving from external feeds will have new unexpected data content over time. Ongoing data profiling was implemented on a production schedule to standardize content and report emerging data problems.

Another aspect of data quality is data enrichment. We drove additional value by enhancing the data from established vendors that maintain customer profile information for marketing purposes. This data is loaded and matched with the existing customer data to further enhance understanding of customer behavior.

The saying "quality is built-in" implies a conscious effort to design components with quality in mind, readjust upstream processes when quality defects are found and maintain a proactive readiness for continuous improvement of business processes and data. This company used improved data quality as a springboard for better decision-making. Problem solved!  

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access