"Measure twice, cut once," is a fairly well known adage, and it applies to a variety of situations. In carpentry, for example, once a board is cut too short, subsequent cuts won't help much. As a carpenter friend says, "I've cut it three times and it's still too short. Now where did I leave my board stretcher?" As it turns out, the adage applies to data warehousing, too.
A data warehouse's viability is dependent on the quality of its content. For data warehousing, "measuring" the input data (gauging its quality) before it is "cut" (loaded into the data warehouse) is both crucial and frequently difficult.
The primary difficulty in measuring the data is that because it is intangible, there are more dimensions to measure than simply length, width and height. For example, how many records were read? How many were integrated and transformed? How many had errors and were rejected? What were the types of errors encountered? How many errors were corrected and then integrated? The list goes on. What's more, the integration/transformation layer can complicate measurement by distorting the dimensions.
Enormous amounts of time, effort and anguish are spent trying to reconcile the warehouse data back to the operational sources. Most companies use laborious manual processes to perform this reconciliation. It is still only adequate at best and does not really give the business community a good understanding of what is happening in their decision support environment.
This is our first of three discussions dealing with this difficult problem. This column will describe the need for and the objectives of an audits and quality control program. Next month, we will describe the audits process, and the third column will cover the quality control process.
The solution to the problem is similar to my carpenter friend's story: we need to measure twice before we cut the data for our environment. In other words, we need to implement strategies that screen or audit the content before it gets loaded and then continue to monitor it after it is loaded. This overall process is the audits and quality control program that will be described in this column.
The first part of the program consists of the process audits that must be in place. These measure the status and results of the integration/transformation processing. These will be the topic of the next column.
The second part of the program must address the quality of the data loaded into the warehouse. In other words, how complete and accurate is the data resulting from the integration/transformation layer. To do this, quality controls must be established to measure the quality and content of the data in your decision support environment. This will be the topic of the third column.
Audits and Quality Control Program
First, let's iterate the observation that the data warehouse environment is not an auditable database in the strict accounting sense. However, the data warehouse must be statistically accurate for performing decision support strategic analysis. This means the data in the warehouse must have a proven and dependable level of data accuracy, consistency and integrity for its intended business usage.
Figure 1: Process Activities
Therefore, the minimal objectives of an audits and quality control program are to:
- Validate the accuracy and completeness of the source data loaded into the data warehouse.
- Maintain a proactive approach for identifying audit and quality control problems at the source as early as possible.
- Establish the meta data repository as an active systems component to aid in measuring and reconciling the load processes.
- Provide the infrastructure to record and monitor the load processes as well as the quality of data over time.
A comprehensive audits and quality control program consists of four major components:
- Systems architecture - to identify where audits and controls should be applied so the appropriate processes can be implemented.
- Meta data repository - to define, capture and track the audits and control measurements. The integration/transformation code will read and write to the repository.
- Methodology - to identify the analysis, design, construction, testing and ongoing operational activities involved in the audits and controls program.
- Resources - to have a successful program requires resources from the business units as well as the technology organization.
Methodology: The Activities
There are four high-level activities used to develop and operate the audits and controls program. The same activities should be used to audit the processes as well as to measure the quality of data attributes. The detailed tasks and scope will vary for the processes versus the data attributes. The four high-level activities are:
- Define the process steps and data attributes that must be audited and measured.
- Assess the existing capability to measure the processes and quality of the attributes.
- Measure the processes and quality whereby the results are recorded in the meta data repository and reported upon each load cycle.
- Correct the processes or the data and then reprocess through the measurement functions.
The audits and quality controls program can be justified based upon the labor reduction and value to the business community.
Maintaining the quality of the input data for a data warehouse is challenging. But, if you understand the potential problems and design solutions for them within your audits and quality control program, then you'll help to maximize the return on your data warehouse investment.
Next month, we will describe the process audits function in detail. This important process should be built into your integration/transformation layer and will greatly improve the reconciliation process.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access