The continuous improvement process I described in last month's column provides a way to proactively manage data quality. Proactive data quality management begins with a plan. In the case of operational systems, that would apply analysis and design activities to develop operational systems that provide quality data and measure the actual quality. In the next step, do, the systems are executed and used in the operational environment to support transaction processing. The output of the do step is the data that resides in the operational data stores, which is captured for use in the data warehouse. The challenge often facing data warehouse teams is that the quality of the actual data is unknown. The warehouse team is then forced to enter the continuous improvement process at the check step to perform data profiling.
Importance of Data Profiling
Data profiling consists of investigating data being captured for the data warehouse so that its quality (including true meaning) is understood. This is a critical data warehouse development step, and temptations to skip it should be avoided.
- Knowledge of the source data quality is useful for refining the extract, transform and load (ETL) development plan. It helps the project manager anticipate the amount of work required so the plan can be adjusted to reflect reality.
- The source data quality impacts the work that needs to be performed during the ETL process. For example, if a phone number has inconsistent formats, the ETL process needs to handle each of the formats, and rules need to be developed for transforming the phone number to a common format.
- Knowledge of the source data quality avoids surprises during the ETL process. Unexpected conditions often require rework. For example, if null conditions aren't properly anticipated for a required field, when the error is encountered, the ETL code will need to be adjusted. The adjustments may also require rework in upstream or downstream programs as well as a repeat of the testing for any completed programs.
- An understanding of the actual quality defects helps to justify expenditures to improve the data. Anecdotal information is usually not enough. When people can point to data deficiencies and show how these deficiencies impact the company, the funding needed to improve the situation is easier to justify.
- Ultimately, the business users have quality expectations. The data warehouse team can build excellent ETL processes, but it will not be in a position to provide any assurance about the data quality unless the input data is well understood. If this happens, the data warehouse will not achieve its objectives and the credibility of the data warehouse will suffer.
- Improvement requirements in the operational system environment can be identified based on discoveries made during data profiling.
Individual Element Quality
There are several dimensions of quality that need to be evaluated for each data element. These include accuracy, completeness, uniqueness, structure, timeliness and value distribution. The first three of these are described here. A quick assessment can be made by looking at each element and identifying the number of distinct values, the highest or lowest 10 values, the percent null, the percent 0, the percent blank, the number of masks, etc. Most data profiling tools provide functionality to perform this task quickly.
Accuracy addresses the correctness of the data. By examining the results of the quick assessment, I can quickly see if I have numeric fields where they don't belong, negative values for a field that should only have positive values, future dates for past events and other data that is obviously incorrect. While I can spot some accuracy problems quickly, others may require verification against known values. For example, for an address, I can electronically validate postal codes against cities and states, but I may need to manually verify the correctness of the actual street number.
Domain conformance addresses an aspect of accuracy. For example, if we are using the standard state abbreviation as the valid value, I would not expect to find more than 50 states in the U.S. Different ways of storing a state (e.g., CA, CAL, California) are typically not a problem for the source system. Within the data warehouse environment, however, the state's data would be divided into three groups. Through data profiling, I would recognize these conditions and insert ETL code to standardize the representation of the state.
Completeness addresses whether or not the field has a value. This is quickly detected by looking at the percent null. Additionally, an examination of frequently occurring values (e.g., 1/1/1900 as a date) may indicate that the operational system inserted a default value for a null condition. Understanding the completeness of fields that may be used as a selection criterion in the data warehouse is crucial to ensure that when a user looks at data segmented by that criterion, all of the expected data is provided.
Uniqueness addresses the validity of the keys. Using the quick assessment information just mentioned, I can quickly compare the number of records with distinct instances of the primary key to detect any problems.
The business data model documents business rules that impact the data through its relationships. Referential integrity is not always enforced within our source data stores. With the business model, the referential integrity rules are known, and these can be verified as part of the data profiling process.
Other relationships also exist within the data, and these can be examined as well. For example, within the lifecycle of an order, there is a relationship between the order date, the sale date and the shipment date. Some elements may depend on the existence of other elements (e.g., an element for weight requires a second element with the weight unit of measure), and some elements may preclude the existence of other elements (e.g., job classification for an employee cannot exist if the retirement date is populated).
Data profiling is a critical step for ensuring the quality of the data in the data warehouse. Data profiling does not actually fix the data problems. It identifies and quantifies the problems. Actions must then be taken to actually correct the problems. I will explore corrective actions in my next column.
Many aspects of quality can be examined during data profiling, and this column barely scratches the surface of the areas to be examined. Numerous data profiling tools are available, and these tools provide standard features that can analyze various aspects of quality. The tools often include a repository for storing the results of the data profiling activities.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access