William wishes to thank Stuart Mullins, Lucidity Consulting Group, for his contribution to this month's column.
In recent years, corporate scandals, regulatory changes and the collapse of major financial institutions have brought much warranted attention to the quality of enterprise information. We have seen the rise and assimilation of tools and methodologies that promise to make data cleaner and more complete. Best practices have been developed and discussed in print and online. Data quality is no longer the domain of just the data warehouse. It is accepted as an enterprise responsibility. If we have the tools, experiences and best practices, why, then, do we continue to struggle with the problem of data quality?
The answer lies in the difficulty of truly understanding what quality data is and in quantifying the cost of bad data. It isn't always understood why or how to correct this problem because poor data quality presents itself in so many ways. We plug one hole in our system, only to find more problems elsewhere. If we can better understand the underlying sources of quality issues, then we can develop a plan of action to address the problem that is both proactive and strategic.
Each instance of a quality issue presents challenges in both identifying where problems exist and in quantifying the extent of the problems. Quantifying the issues is important in order to determine where our efforts should be focused first. A large number of missing email addresses may well be alarming but could present little impact if there is no process or plan for communicating by email. It is imperative to understand the business requirements and to match them against the assessment of the problem at hand. Consider the following seven sources of data quality issues.
- Entry quality: Did the information enter the system correctly at the origin?
- Process quality: Was the integrity of the information maintained during processing through the system?
- Identification quality: Are two similar objects identified correctly to be the same or different?
- Integration quality: Is all the known information about an object integrated to the point of providing an accurate representation of the object?
- Usage quality: Is the information used and interpreted correctly at the point of access?
- Aging quality: Has enough time passed that the validity of the information can no longer be trusted?
- Organizational quality: Can the same information be reconciled between two systems based on the way the organization constructs and views the data?
A plan of action must account for each of these sources of error. Each case differs in its ease of detection and ease of correction. An examination of each of these sources reveals a varying amount of costs associated with each and inconsistent amounts of difficulty to address the problem.
Entry quality is probably the easiest problem to identify but is often the most difficult to correct. Entry issues are usually caused by a person entering data into a system. The problem may be a typo or a willful decision, such as providing a dummy phone number or address. Identifying these outliers or missing data is easily accomplished with profiling tools or simple queries.
The cost of entry problems depends on the use. If a phone number or email address is used only for informational purposes, then the cost of its absence is probably low. If instead, a phone number is used for marketing and driving new sales, then opportunity cost may be significant over a major percentage of records.
Addressing data quality at the source can be difficult. If data was sourced from a third party, there is usually little the organization can do. Likewise, applications that provide internal sources of data might be old and too expensive to modify. And there are few incentives for the clerks at the point of entry to obtain, verify and enter every data point.
Process quality issues usually occur systematically as data is moved through an organization. They may result from a system crash, lost file or any other technical occurrence that results from integrated systems. These issues are often difficult to identify, especially if the data has made a number of transformations on the way to its destination. Process quality can usually be remedied easily once the source of the problem is identified. Proper checks and quality control at each touchpoint along the path can help ensure that problems are rooted out, but these checks are often absent in legacy processes.
Identification quality problems result from a failure to recognize the relationship between two objects. For example, two similar products with different SKUs are incorrectly judged to be the same.
Identification quality may have significant associated costs, such as mailing the same household more than once. Data quality processes can largely eliminate this problem by matching records, identifying duplicates and placing a confidence score on the similarity of records. Ambiguously scored records can be reviewed and judged by a data steward. Still, the results are never perfect, and determining the proper business rules for matching can involve trial and error.
My next column will continue this discussion.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access