In the late 1990s, we were hired by a client to lead a data quality initiative on a large project that involved migrating data from legacy systems to open architecture. In developing the program for our client, we immersed ourselves in a tremendous amount of literature from respected authors on principles of data quality. What we discovered is that a great deal of the data quality work we reviewed was focused on principles and theory but lacking in the area of implementation.
As a result, we struggled with exactly how to design and implement a practical data quality program for our client. What our client required was a straightforward methodology for implementing, practicing and promoting data quality. Our task was to develop a methodology that was repeatable with well-defined objectives and deliverables. The result was the data quality cycle 1.0.
Over the course of the last 10 years, we have since refined the DQC, which is now in its second release (DQC 2.0). In this article, we will describe the end-to-end data quality process that we have developed, which can be implemented in virtually any environment. Hopefully you will gain valuable insight into developing an intelligent, standardized data quality strategy that you can implement at your company with tremendous success.
Figure 1: The Data Quality Cycle
Figure 1 represents the basic DQC. In later sections, we will examine individual processes and decision points within the cycle. However, at its most granular level, the DQC 2.0 consists of four main phases: discovery, definition, remediation and prevention. As with any quality effort, quality improvement is a continuous effort. For this reason, phases within DQC 2.0 are iterative and represent a cycle rather than an endpoint.
The DQC begins with the discovery phase, which is illustrated in Figure 2.
Figure 2: Discovery Phase
The first step in the discovery phase is to identify a candidate data quality problem. If there are several pain points, this step may require prioritization. Not all data quality errors are equally important. The business should decide which problems have a direct impact on the strategic or tactical goals of the organization. If, for example, customer satisfaction is an important business driver, then quality customer data is an obvious prerequisite. In this case, customer data quality problems may be ranked high on the list as opposed to data quality problems with secondary business drivers.
The second step in this phase is to estimate the cost of the problem. The third step requires estimating the solution to the problem. Estimating the cost of a data quality problem and the remediation are extremely important steps; however, they are frequently neglected.
The reason it is imperative to estimate both the cost of the problem and the remediation is that there must be ample justification for every data quality initiative. While this is only an estimate, it will provide cost justification for proceeding to a thorough assessment and deploying a remediation strategy. This will help "sell" the data quality initiative to senior and executive management and engender the necessary support.
Once you have estimated the cost of both the problem and remediation, the next step is to establish the objectives of the data quality initiative. There must be clear and concise objectives that can be validated to ensure the strategy is effective.
Following the discovery phase is the definition phase. The process flow for this phase is illustrated in Figure 3.
Figure 3: Definition Phase
The definition phase is where the heavy lifting begins. The first step is to define the measurement criteria. Regardless of whether you are verifying product quality or data quality, measurement is essential. This step should answer the fundamental question: what are we going to measure? This will include measurements of data type conformance, syntax, completeness, precision, validity, accessibility, timeliness, etc.
With the measurement criteria firmly established, it is time to develop your assessment plan. This will not only include what will be measured, but also where and how the measurements will be applied. You may be required to assess the data in the many systems that contain the data to get an accurate picture of its quality. Depending on the system and business processes that impact the data, you can define how the metrics and measurements will be applied.
After the plan and measurement criteria are defined, it is time to assemble the team. This should include highly skilled workers that are well-versed in quality measurement techniques and methods. Team members will also include business knowledge experts familiar with both processes and data. The team must be given the charter to assess defects that is mandated by senior management in order to alleviate push-back from data producers and consumers who may feel as though their data is being scrutinized.
The next step is to measure the defects and to determine the true cost of the defects that are discovered. It may be that, once the analysis is complete, the actual costs differ significantly from the estimated costs arrived at in the discovery phase. What may have been considered a high cost (and therefore, probably high priority) data quality objective is not as costly as believed - or estimated low-cost data quality issues may turn out to be significant.
At this point, you will compare the actual versus the estimated costs. The result of this comparison and the ability to tie these costs back to your primary data quality objectives allow you to manage your data quality remediation efforts and maximize the limited amount of resources you have to get your biggest benefit.
It is important to once again answer the fundamental question, is the cost of the problem worth pursuing? If the costs are justified, then proceed to the remediation phase. However, if the costs are less then expected and not worth correcting, return to the previous phase in order to identify other data quality problems that are truly worth pursuing.
Figure 4 illustrates the foundational processes within the remediation phase:
Figure 4: Remediation Phase
In the remediation phase, the decision has already been made to correct the data quality problem. The costs of producing bad data have been evaluated and determined to be unsatisfactory. Two branches of activity occur in this phase: pure remediation (cleansing/scrubbing/cleaning) activities, and establishment of the groundwork for the upcoming prevention phase.
In the pure remediation portion of this phase, the first step is to determine the extent of the problem, which will inform the clean-up strategy that you develop. There is a caveat, however; not all problems can be corrected - or should be.
Depending on the nature of the data with quality issues, you may be prevented from changing any of it - even if the data is obviously incorrect. For example, much of the company's finance data cannot be altered once it is recorded in the ledger. Also be wary of correcting data that is downstream from the source - if you correct bad data in your data warehouse today, what is going to stop it from reappearing tomorrow? Make sure you follow the information flow from producer to final consumer, and correct the bad data everywhere it occurs, starting with the source.
On the prevention-preparation side of this stage, the work may be daunting. This is where having a workable plan is a must! First, identify the complete information lifecycle. It is important to identify both the producers of the information as well as the customers (downstream consumers of the data that is produced).
In order to correct a data production problem, you will need to obtain or create business process models and information models of the data that is being captured and processed. These models are independent of the actual systems processes and data models that define the mechanics of the system. Rather, these models are technology independent and serve to point out where processes are not functioning correctly or are inconsistent between departments, systems or divisions. Information models ensure that these same subdivisions have a consistent understanding of the data that is being collected and shared.
By doing this analysis, you can then determine who in your company is creating, updating and deleting data through the use of a create, read, update and delete (CRUD) matrix. The planned remediation of each major information component should be mapped against the business processes, departments and divisions that are accessing it in any way. As you work through this process, it should become clear if the inconsistency in data is being caused by duplication in function or the responsibility for critical business information is splintered between different parts of the organization.
Once the systematic and organizational root causes for the data quality problems have been identified, you can move on to preventing them from occurring, which is described in the next phase.
The fourth phase of the DQC is the prevention phase, which is illustrated in Figure 5.
Figure 5: Prevention Phase
The prevention phase is the thrust of the DQC. Although prevention is essential to ensuring quality data, it is often overlooked. Far too often, data quality efforts stop at remediation, which fails to fix the cause of the bad data and is therefore of limited value. The real value of the DQC comes when errors are prevented from occurring.
Preventative measures require strong data governance. If a data governance program does not exist, then one must be defined and implemented. Data governance requires support from executive management and clearly defined roles and responsibilities for the organization. By defining the roles, responsibilities and other governance and administrative structures that guide data creation, storage, usage and retention, a set of consistent, cohesive data policies can be created. These policies are core to your prevention strategy because they define the broad strategies that should be followed with data to accomplish the stated goals.
With a governance model in place, clearly defined roles and responsibilities, and policies, it is time to define the prevention strategy. Inputs into this task are the business and information models and root-cause assessment. Once a prevention strategy has been worked out that makes use of the policies, you must once again determine the cost of implementing the strategy in order to prevent data quality issues. Can the proposed measures be cost justified? What is the impact to the organization, and what is the expected benefit from reduced or eliminated data quality issues? If the prevention program can be justified, it should be implemented. If not, the analysis should be retained, and the tasks of discovery/definition should recommence. Either way, identifying new candidate data quality problems to analyze and resolve is the next step, and then begin the DQC again.
It is nearly impossible to pick up a newspaper without reading a story about a company struggling to fix a significant data quality breach. In speaking at conferences and working with clients in the area of data management, we have learned that most organizations recognize that data quality is important. What they grapple with is how to proceed with a data quality initiative that will provide value.
The success of any data quality effort is predicated on having a well-defined, repeatable process with consistent, value-added deliverables. With the "what" and "why" of data quality fairly well defined, the DQC 2.0 addresses the "how." We hope that you will be able to leverage the information we have provided here to develop an intelligent data quality strategy in your organization.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access