Here are straight answers to some important questions about data cleansing. The questions came from students in my "Information Quality Assessment, Cleansing and Transformation" tutorial at a recent Data Warehousing Institute Conference.
Where do you cleanse the data?
If data is still being used from its source database(s), it should be cleansed at the source and then transformed to the data warehouse. Not doing so creates three problems:
- Processes performed using the defective data in the source databases will continue to fail and continue to incur business costs of the process failure and information scrap and rework.
- The uncorrected source data will have the potential to corrupt the data warehouse as changed data is propagated from the source over time.
- Reports from the source data and cleansed data warehouse data that should be equivalent will not be, causing confusion and lack of trust.
If you are not able to clean the data at the source, you should assess its quality (completeness and accuracy) to determine its reliability and also the need for information quality improvement. This assessment must address accuracy and not just validity. Validity is the degree to which data conforms to the business rules it is a valid value, within the right range, related to a valid referenced object, etc. Accuracy means the valid value is a "correct" value. Data often has a high incidence of valid values especially default values which are valid but not correct.
You must also measure and post the quality (completeness, validity and accuracy) of the data warehouse data. Knowledge workers who use it must know its reliability so that they can factor it into their decisions.
Finally, if you cannot correct data at the source, you should direct all processes and queries using that data to access it from the operational data store or data warehouse where it has been corrected. Make the corrected data the new record of reference.
How do you know when to stop cleansing data?
Crosby asserts there is no reason for defects in any product or service, and I believe this is true for information quality as well. However, we cannot go back in time to improve the processes that have left us the legacy of defective data. So, how much do you invest in information scrap and rework to correct defective data? The information customers must answer the prioritization question. Who will use the information? For what purposes? What are the costs and risks of process failure or decision failure due to nonquality data? The costs include:
- Money and materials wasted in process failure;
- Legal liabilities, imitation and reputation;
- Customer alienation and lost customer lifetime value; and
- Lost and missed opportunity.
This drives the requirement for what data needs correction and how much effort should be expended.
Information quality has less to do with zero-defect data than it does with consistently meeting knowledge-worker and end-customer expectations in their information products and services. Knowledge-worker customers of the data warehouse can generally live with some degree of omission and error if they know its extent. This requires quality assessment. Data without this assessment requires a label stating "unaudited data," indicating there has been no quality assessment to assure the level of reliability.
Meet with representative knowledge workers and prioritize the attributes in the data warehouse into A, B and C priorities:
- A-priority data is close to zero defect (e.g., error or omission could have high cost of failure). For example, the misspelling of customer name could cost loss of the customer; inaccurate location of oil well pipelines could cause drilling through an adjacent well, rupturing it and blowing up a $500 million oil platform with probable loss of life.
- B-priority data is important second priority data.
- C-priority data is optional or non-critical data where the cost of omission and error is marginal.
You must address cleansing of all A-priority data. Then address B-priority data as resources allow and as directed by an information steering team.
Where is the line between automated and human cleansing?
Automated data correction can be performed to achieve data validity. You can take known values in reference databases, such as postal codes and cities for an address. However, you can only test electronically whether an address is valid. Some address cleansing software can apply address corrections and even apply postal service change-of-address data. But some people move without forwarding information. It requires human confirmation that an address for a specific person is correct.
Human correction is required to measure, confirm and correct most data to accurate values. For example, to assure that the physical dimensions of a tangible product are accurate requires human measurement. Begin with a small sample of 50 to 100 occurrences of the A and B-priority data and measure it for accuracy to get an idea of the extent of any accuracy problems. Techniques such as mail surveys or telephone contacts may be required to assure and correct personal data that is important to the organization.
How do you deal with data that has multiple identifiers?
Matching of data redundantly stored in disparately defined databases is one of the painful data cleansing problems. When many redundant files containing records about a single entity and those records have embedded meaning or non-synchronously defined keys, then you should first seek to consolidate any duplicate records within a single file or database. Keep a cross-reference table to relate the surviving "occurrence of record" to the records that previously existed but no longer do. This is used to redirect any business transaction using "old" identifiers to the occurrence of record. Also maintain an audit file with before and after images of the data to assure you can reconstruct the original records in the event of a misconsolidation.
Now de-duplicate and consolidate the records within all the other redundant files required to load the data warehouse. Once those files contain a single occurrence of record, match and consolidate them across the redundant files selecting the most reliable values for propagation to the data warehouse. Correct and synchronize data values at each source for consistency to the extent possible. For example, use full names as opposed to only initials. This is required to maintain "equivalence" of data in these redundant files. Maintain your cross-reference table of related occurrences, again so you can maintain consistency across the files.
How do you manage cleansing of external data?
If you buy external data (demographic, profile, census, geospatial, financial, etc.) to enhance your data warehouse, you have every right to ask for a written warranty of reliability of the data. There should be remedies for nonquality data in excess of their warranty and your acceptance. Audit the data that you buy if possible.
For nonpurchased data, set expectations for quality and develop a service level agreement. Measure and report the reliability level. Provide beneficial feedback to the source information providers in exchange for their improving the quality of the data you need from them.
What do you think? Send your comments to Larry.English@ infoimpact.com.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access