Data profiling does not improve data quality directly. It provides a better understanding of the challenges you face in delivering information that meets business expectations. To improve data quality, you must act on the information.
Symptoms are the visible results provided by data profiling work. These represent the errors found in the data sources; you must decide how to deal with these.
When you encounter a data quality deficiency, there are four potential actions you can take: reject, ignore, fix or substitute.
Reject. Rejection consists of dealing with data that is so bad that you cannot use it. For example, if you encounter nonnumeric data in a field reserved for the sales amount, you cannot process the record element (or entire record). When this happens, you need to capture the errors and build a process to enable correction of the data before forwarding it to the data warehouse. Sometimes, when accuracy is more important than completeness, the decision may be made to simply move forward with other data.
Ignore. The next option is to ignore the error. In your data profiling, you may discover some data deficiencies. Your business rules, however, may be such that these deficiencies do not materially impact your ability to make strategic decisions with the data warehouse. In this case, you may simply capture the data, knowing that it is not perfect. You should also provide information in the metadata to ensure that business users understand the level of quality in the data.
Fix. Fixing the data is sometimes an option. It is possible that data elsewhere in the environment can be used to derive the correct value. For example, if you have a ZIP code and the state is invalid, you could use the ZIP code to derive the correct state.
Substitute. Substitution is similar to fixing in that you are inserting a value that enables you to move forward. In the case of substitution, however, you may not know the correct value and may be able to substitute a value of "unknown" to have a value in each field. When a substitution is performed, the downstream impact needs to be considered so that the business community understands that "unknown" is a valid value.
When you fix the data or substitute a value, you have two additional options. First, you may choose to record both the original value and the adjusted value in the data warehouse. This is sometimes needed to provide an audit trail and to provide information in defense of a business decision made with the incorrect data. Second, you may be able to generate a transaction that will update information in the source data and correct it. While that may appear to solve the problem, it really doesn't - you are still dealing with the symptom. You've simply built a process to ensure that the error doesn't impact the data warehouse. To truly solve the problem, you must understand the root causes so that you can prevent the symptom from occurring in the future.
The root cause is uncovered by repeatedly asking why the problem occurs. With a flat tire, when you initially ask why, you may learn that the tire had a bald spot. Asking why again may reveal that the bald spot occurred because the car was out of alignment. Further investigation may reveal that the driver was not aware of how to maintain tires, an activity that includes having proper air pressure and periodically performing visual inspections. Once you identify this, then you can take action (learn how to maintain tires) that will prevent a similar problem from occurring in the future.
Root cause analysis is often done during data profiling. When you uncover significant data problems, you need to perform additional analysis to understand the causes. Instinctively, you may conclude that the source program is deficient. For example, maybe the data validation is not thorough or there are some program coding errors. When you've come to a conclusion, you should document the problem and ensure that a change request is submitted and appropriately prioritized to fix the source system.
Very often, the problem is external to the source system. For example, you may discover that 20 percent of your population was born on the same date. In this case, the source system checks each date individually and lets the data through because the date is valid. It cannot detect that when the date is missing or unclear, a data entry person inserts any date (e.g., 11/11/11) into the field because it's a mandatory field. Similarly, when you make a sale, the salesperson provides information that may be inaccurate (but good enough to pass reasonable validation steps) about a customer. In these cases, your business rules are such that people don't have the training or incentives to capture correct data. Fixing the source system won't solve these problems. If the root cause is in the business process, the corrections must be implemented there.
Data profiling makes data quality issues visible. When it comes to loading the data warehouse, you have two challenges. First, you must find a way to load the required data by taking one of four actions - reject the data (if accuracy is more critical than completeness), accept the data (if it is within your tolerance levels), fix the data (if you have information to do so) or substitute a value (when that would meet your business needs).
The second challenge is the one with the greater payback; understand why the problems occurred by performing analysis so that you can take actions that substantially address the root causes.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access