Poor data quality often has a greater effect on companies than they realize. In most cases, the costs associated with poor data quality are not only ignored but subsumed into the overall category of the cost of doing business. Yet, real costs are associated with nonconforming data – and they add up. However, a business can eliminate these significant costs by instituting a data quality improvement program as a core component of its business intelligence strategy.
To illustrate this concept, let's explore three case studies. Each shows a different aspect of the effects of poor data quality. The first example demonstrates a socioeconomic impact; the second shows a purely economic impact; and the third indicates how failing to use data quality techniques can open the door for fraud.
Case 1: The Presidential Election
The controversy surrounding the 2000 Presidential election and the Florida recount shows the profound business effect associated with information of questionable quality. The lack of a clear winner, directly resulting from poor data quality, immediately led to a drop in stock prices during the days after the election. At least four data quality issues are evident from the Florida election:
- Voter confusion because of poor data presentation – Use of the butterfly ballot in Palm Beach County is an example of poor information presentation that led to many errors.
- Validating data before it is verified – The use of punch cards leads to questions regarding the accuracy of vote counting. Automated tabulation of improperly punched ballot cards, when the chad is not completely ejected, leads to questionable tallies.
- Confusion based on conflicting data sources – Conflicting data sources and data quality errors in Voter News Service tallies affected the results of the decision process, leading the VNS to flip-flop twice about its predicted winner for the state.
- Built-in margin of error – Embedded in Title IX, Chapter 102, of Florida law is a data accuracy implication of an expected margin of error of one-half of 1 percent of the votes. The automatic recount is a good example of defined governance associated with a data quality problem.
Case 2: A Supply Chain Debacle
This past February, a war of words erupted between shoe and apparel manufacturer Nike Inc. and i2 Technologies, the software developer that provided Nike with a new demand and supply inventory system. Nike cited order problems that led to expensive manufacturing problems during deployment of the new system.
For example, some shoe orders were placed twice, once each in the old and new systems, while the new system allowed other orders to fall through the cracks. This resulted in overproduction of some models and underproduction of others. Nike was even forced to make some shoes at the last minute and ship them via air to meet buyers' deadlines.
Ultimately Nike blames these system problems for a $80 million to $100 million cut in third- quarter sales that caused the company to miss earnings estimates by as much as 13 cents. The day that Nike announced this, its stock price dropped 25 percent in value from $49.17 to $38.80. On the other side, i2's senior management claimed that their software was not responsible for Nike's shortfalls.
Case 3: CD Fraud
In November 1999, a New Jersey man admitted to a scam that fooled computer fraud detection programs at two music- by-mail clubs. The man used 1,630 aliases to buy compact discs at special introductory rates and then sold the CDs at flea markets for a 400 percent markup. He was able to fool the companies by adding fictitious apartment numbers or unneeded direction abbreviations to his addresses and extra punctuation marks in his names and addresses. Taking advantage of the mail- order companies' inability to filter out bad data, the man was able to perpetrate a fraud totaling more than $250,000.
In all three cases, the results could have been avoided if the organizations involved had taken time to evaluate the detrimental effects of poor data quality. Usually these evaluations have consisted of little more than recounted anecdotes, hazy feelings about a difficult implementation or tales of customer dissatisfaction. Having a concrete way to measure the cost of poor data quality allows a company to determine the extent to which bad information affects its bottom line. This process also highlights opportunities to improve customer relations, optimize the production stream and enhance employee satisfaction by analyzing and improving data quality.
Building a Data Quality Improvement Program
The first step in initiating a data quality improvement program is to assess current data quality. This provides a way to identify the areas with the greatest need for improvement, as well as setting a baseline against which further improvement can be measured. The data quality current state assessment requires performing data profiling and data flow analysis. Results of these processes are reviewed in the context of how data conforms to expectations as it flows throughout the organization.
Data profiling is a detailed analysis of available information. Typically data profiling will unearth metadata about the data being investigated. This is done through simple analytics such as the range of values, inferred data types, maximum and minimum values, and frequency distributions. Profiling also focuses on more detailed analysis, such as cross-column and cross-table relationships. This process of discovery can expose business and data quality rules embedded within the data.
Data flow analysis is the process of mapping how information flows throughout the organization. Because information is used to fuel transaction processing as well as analytic processing, data instances may be modified as they move from one processing stage to another. A map of organizational data flow allows the organization to isolate the processing stages that contribute to poor data quality.
Given a set of discovered data quality rules and a sequence of data flows, the organization can measure how well a data set conforms to user expectations at any point within the system. More importantly, this enhances the framework for the next step of evaluating the costs associated with poor data quality.
The Data Quality Scorecard
Assessment results can be used to build an economic model that evaluates the costs associated with instituting improvements. This model can be viewed as a scorecard that documents data quality levels associated with a set of data quality dimensions measured at specific locations in the information chain. Here are the steps involved in building a data quality scorecard to summarize the overall cost associated with low data quality and help identify the best opportunities for improvement:
- Map the information chain and data flows to understand how information flows across your organization. Using this chain, you can locate the sources of any potential problems.
- Identify data flow. Determine the data your system uses and the processing stages through which it passes. Detail the record or message structure so that you can directly associate any error conditions with the specific data set in which the error occurs.
- Interview employees to assess the internal impact of flawed data. Aggregate and sum the time all employees devote to data quality issues for each stage in the information chain.
- Interview customers. To determine the impact of decreased customer revenue, talk to current and former customers to understand the reasons for any decrease in business or attrition.
- Isolate flawed data. Annotate the information map with results of your interviews. Note the source of any data flaw at each processing stage, along with a list of the activities to which you can attribute those flaws.
- Identify the impact domain. Attribute the flaws and activities to impact domains by building a matrix that classifies each data quality problem. The first axis identifies the problem and its location in the information flow; the second represents the activities associated with each problem; and the third denotes the impacts. In each cell in this matrix, insert the estimated cost associated with that impact. If no estimate can be made, at least indicate the impact's order of magnitude.
- Aggregate the total. Superimpose the matrix onto a spreadsheet and build an aggregation model in which the costs can be summarized in different ways.
- Identify opportunities for improvement. Use the model to look for the best opportunities for improvement of where you can get the biggest value with the smallest investment.
The data quality scorecard is a framework for calculating the return on investment for improved project implementation. The scorecard can be used as a management tool, in which any suggested improvement is connected with the cost of designing and implementing the improvement, along with a time frame for implementation. Ultimately, this scorecard can be used as the basis for an ongoing data quality improvement project that will subsequently enhance all of the company's intelligence efforts.
David Loshin is the president of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of Enterprise Knowledge Management - The Data Quality Approach (Morgan Kaufmann, 2001) and Business Intelligence - The Savvy Manager's Guide and is a frequent speaker on maximizing the value of information. Loshin may be reached at firstname.lastname@example.org.