It is becoming increasing clear that much (probably most) data is of poor quality. Some data is simply incorrect, other data is poorly defined and not understood by data customers; still other data is not relevant to the task at hand. The impact is enormous. Poor quality data is at the root of many issues of national and international importance that dominate the news for weeks at a time. Fortunately of course, most data quality issues are more mundane. However, in aggregate, they may be even more costly.
Of course, most data quality issues do not announce themselves as such - many people and organizations are not aware of the importance of the issues. This article aims to shake them from their slumber. It presents a high-level synthesis of so-called data quality disasters and everyday issues that bedevil organizations. The most important point is that poor data quality is an unfolding disaster.
- Poor data quality costs the typical company at least ten percent (10%) of revenue; twenty percent (20%) is probably a better estimate.
- Most data quality issues are hidden in day-to-day work. If they think about it at all, most people and organizations conclude that poor data quality is just a fact of life.
- From time to time, a small amount of bad data leads to a disaster of epic proportions. There is no way to tell when or where the next disaster will occur.
This article focuses solely on building awareness. It stops short of offering prescriptions - they are obvious. They involve extending the tried and true methods of quality management into the realm of data. We do not claim that doing so is easy - data differs from manufactured products in critical ways. However, the extensions have been made and are described in recent books by myself, Michael Brackett, Larry English, David Loshin, Richard Wang and others. Organizations that have applied those prescriptions diligently have made enormous improvements.
The next section of this article defines data quality. The following two sections describe recent data quality disasters and mundane data quality issues, respectively. The section after that synthesizes estimates of the cost of poor data quality (COPDQ) to support our overall estimates.
Data and Data Quality Defined
After J.M. Juran, we define "data to be of high quality if they are fit for their intended uses in operations, decision making and planning."2 (See Figure 1.) While there are, quite literally, hundreds of dimensions of data quality, a relatively few dimensions are most important in practice. Almost all customers want data that is relevant to the task at hand, easy to understand and correct.
Figure 1: Data Quality
As with the quality of manufactured goods, high-quality data stems from well-defined and managed processes that create it, store it, move and manipulate it, process and use it. Thus, data quality involves "getting the right and correct data in the right place at the right time to complete the task at hand."
Data Quality Disasters in the News
For the past several years, data quality disasters (though not, of course, labeled by the news media as such) have occurred with striking frequency. These disasters have dominated the national and international news for weeks. The next several paragraphs highlight five data quality disasters in chronological order as they appeared in the media.
In May 1999, during the Bosnian War, the United States inadvertently bombed the Chinese Embassy.3 The bombing stemmed directly from a data error. The "facts" associated with what was located in the intended target were simply out of date. Instead of a legitimate target, the Chinese Embassy was bombed and three Chinese citizens were killed.
The data quality disaster of the year 2000 was the presidential election. Most people know a sketch of the facts. The election hinged on the vote counts for George Bush and Al Gore in Florida. For weeks, the national and international press followed the machinations of the candidates and various levels and branches of the Federal and Florida governments as they pressed their cases, counted and recounted votes, tried to decide whether a "hanging chad" signified voter intent and maneuvered for advantage with the Supreme Court. In the end, of course, the State of Florida certified that Mr. Bush had carried the state. He won the election.
Since the election, a number of organizations have reexamined both the results and the underlying processes. Most conclude that George Bush was indeed the winner in Florida.4 However, the deeper analysis of voting processes (voter registration, ballot design and testing, vote counting and so forth) reveals fundamental issues. For example, a CalTech-MIT report on the quality of the electoral process concluded that the vote could be accurately counted only to within two percent nationally. Results may be even worse in some locations.5
One might take comfort if aggressive efforts to rectify voting irregularities had proven successful. Not so. In the recent recall election in California, two independent studies found that more than 383,000 votes - 4.6% of those cast - did not have a valid vote on the recall.6