Claudia would like to thank Lowell Fryman for his assistance in writing this month's column.

This is the third of a three-article series on data quality within the data integration/transformation processes. The first article (April, 1999) identified the need for a structured, automated approach and proposed a two-pronged program of process audits and data quality control. The second article (May, 1999) examined the process audits component in greater detail. This article examines data quality control.

The old computing proverb "garbage in, garbage out" hasn't lost its relevance over the years. Data warehouse queries based on faulty data will probably lead to faulty analysis. If our goal is valid analysis, then, in addition to posing well-formed queries, we need to load the data warehouse with the best possible data.

The users need a defined level of confidence in the quality of their data. They need to understand the quality of their data, regardless of the level of quality. Assuming that the source data is good to start with, the integration/transformation (I&T) processes should provide this understanding. By auditing the I&T processes and applying metrics to the source data attributes, we can measure and track data quality. Measuring data quality means comparing the accuracy and completeness of specific data attributes with the rules in place that manage the strategic decision support analysis functions of the enterprise.

First, it is probably not appropriate to define quality measurements for every attribute in the data warehouse. Each attribute will have its own strategic value to the organization. In fact, the strategic value of an attribute may vary, depending on the business unit performing the analysis and the type of analysis being performed (e.g., strategic, tactical or operational).

The following is a sample of quality measures that are often useful. The number of times the attribute's value was:

  • Null
  • Not null
  • Valid
  • Invalid
  • Equal to its default value
  • Replaced
  • Caused the record containing it to be rejected

In addition, measures across multiple attributes in the same record may be correlated.
Many organizations perform some quality measures in their I&T processes today. However, the results of these measurements are usually produced in hard-copy reports that are frequently not available to those who need them most. Since the users rarely see these reports, they are unlikely to be aware of the data quality problems revealed by the I&T processes. Even if the I&T processes produce high-quality data, without confirmation, the users will still lack a defined level of confidence in their analyses.

In contrast, by storing the results as meta data, we can quickly, efficiently and simultaneously deliver the results to multiple users. For instance, meta data queries can be formed using tolerance criteria established by the users. Then, data warehouse administrators can execute the queries to determine quickly and reliably whether the data resulting from the current iteration of the I&T processes should be loaded into the data warehouse. A meta data repository also allows the users to quickly determine the content of the data warehouse at any time and permits them to track trends in data quality.

A viable process audits and data quality controls program requires an organization to:

  • Integrate the audits and controls into your data warehouse development life cycle.
  • Implement a data stewardship program within the business community to provide definition, direction, communications and management of the audits and controls program.
  • Market the program internally based on the value delivered to the customer and the effort necessary to insure the quality of the data.
  • Develop automated functionality to eliminate the need for manual balancing of resources to the extent possible.
  • Develop a formal reconciliation methodology for the data warehouse data acquisition process.
  • Incorporate a business-driven review, investigation and resolution to correct data that "is not within quality tolerances."
  • Address audits and controls and integrity variances as close to the source and as soon as possible.
  • Develop a GUI-based application to provide effective reporting of the information contained in the meta data repository.

The business value of creating and communicating a defined level of certainty for data quality, although somewhat intangible, is often enormous and worth the effort. By adopting a structured process audits and data quality control methodology, organizations can increase their likelihood of success. Finally, in situations where a manual methodology is already in place, conversion to an automated approach, based on a meta data repository, can improve communications, costs and effectiveness.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access