The importance of data quality with respect to the strategic planning of any organization cannot be stressed enough. The Data Warehousing Institute (TDWI), in a recent report, estimates that data quality problems currently cost U.S. businesses $600 billion each year. Time and time again, however, people claim that they can't justify the expense of a data quality strategy. Others simply do not acknowledge the benefits.

While a data quality strategy is important, it takes on new significance when implementing a data warehouse. The effectiveness of a data warehouse is based upon the quality of its data. The data warehouse itself is not inherently responsible for cleansing data; if this were true, the same error-prone data would continuously flow to the warehouse, needing to be cleansed repeatedly during iterative operations. The best place to adopt data cleansing efforts is at the source – in production, before loading it to the data warehouse. By cleansing data in production instead of in the data warehouse, organizations save time and money.

Data Quality Problems in the Real World

The July 1, 2002, edition of USA Today ran an article entitled, "Spelling Slows War on Terror." This article demonstrates how hazardous data (poor data quality) can hurt an organization. In this case, the organization is the U.S., and we are all partners. The article cites confusion over the appropriate spelling of Arab names and links this confusion to the difficulty U.S. intelligence experiences in tracking these suspects. The names of individuals, their aliases and their alternative spellings are captured by FBI, CIA, Immigration and Naturalization Service (INS) and other agencies' databases.

Figure 1 clearly shows the data conflicts between organizations. In this example, each of the four agencies made independent inquiries to identify common aliases for "Gadhafi." Through careful examination, we see that each source derived a different answer set. The result is that organizations cannot effectively leverage shared information because their data sets do not match.


Figure 1: Data Flow in Government Agencies

The Seriousness of Data Quality Problems

A CIA official conducting a search on Libyan leader Moammar Gadhafi found more than 60 alternate spellings of his name. Some of the alternate spellings are listed in Figure 2 and were seen in Figure 1. In this example, we are talking about someone who is believed to have supported terrorist-related activities and is the leader of an entire country, yet we still cannot properly identify him.


Figure 2: Alternate Spellings of Libyan Leader's Surname

Note that this example was obtained through the sampling of CIA data only – imagine how many more alternate spellings of Gadhafi one would find upon integrating FBI, INS and other sources. Fortunately, most of us are not trying to save the world, but data quality might save our business!

Why Data Quality Problems Go Unresolved

Problems with data quality are not unique to government; no organization, public or private, is immune to this problem. Excuses for doing nothing about it are plentiful:

  1. It costs too much to replace the existing systems with data-sharing capability.
  2. We could build interfaces into the existing systems, but no one really understands the existing data architectures of the systems involved.
  3. How could we possibly build a parser with the intelligence to perform pattern recognition for resolving aliases, let alone misspellings and misidentifications?
  4. There is simply no way of projecting return on investment for an investment such as this.

Quite similarly, the USA Today article cited the following three problems, identified publicly by the FBI and privately by CIA and INS officials:

  1. Conflicting methods are used by agencies to translate and spell the same name.
  2. Antiquated computer software at some agencies won't allow searches for approximate spellings of names.
  3. Common Arabic names such as Muhammed, Sheik, Atef, Atta, al-Haji and al-Ghamdi add to the confusion (i.e., multiple people share the same name).

Fradulent Data Quality Problems

To further complicate matters, a New York Times article published July 10, 2002, confirmed that at least 35 bank accounts had been acquired by the September 11, 2001, hijackers during the previous 18 months. The hijackers used stolen or fraudulent data such as names, addresses and social security numbers.

Data Collection

Whether you're selling freedom or widgets, whether you service tanks or SUVs, you have been collecting data for a long time. Most of this data has been collected in an operational context, and the operational life span (approximately 90 days) of data is typically far shorter than its analytical life span (endless). This is a lot of data ­ with a lot of possibilities for quality issues to arise. Chances are high that you have data quality issues that need to be resolved before you load data into your data warehouse.

Solutions for Data Quality Issues

Fortunately, there are multiple options available for addressing data quality problems. We will describe three specific options here:

  1. Build an integrated data repository.
  2. Build translation and validation rules into the data-collecting application.
  3. Defer validation until a later time.

Option 1: Integrated Data Warehouse

The first and most unobtrusive option is to build a data warehouse that integrates the various data sources, as reflected in the center of Figure 3.


Figure 3: Integrated Data Warehouse

An agreed-upon method for translating the spellings of names would be universally applied to all data supplied to the integrated data warehouse, regardless of its source. Extensive pattern-recognition search capability would be provided to search for similar names that may prove to be aliases in certain cases.

The drawback here is that the timeliness of quality data is delayed. It takes time for each source to collect its data and then submit it to the repository where it can then be integrated. The cost of this integration time frame will be different depending on the industry.

Clearly, freedom fighters need high-quality data on very short notice. Most of us can probably wait a day or so to find out if John Smith has become our millionth customer ­ or whatever the inquiry may be.

Option 2: Value Rules

In many cases, we can afford to build our translation and validation rules into the applications that initially collect the data. The obvious benefit of such an approach is the expediency of access to high-quality data. In this case, the agreed-upon method for translating data is centrally constructed and shared by each data-collection source. These rules are applied at the point of data collection, eliminating the translation step of passing data to the integrated data warehouse.

This approach does not alleviate the need for a data warehouse, and there will still be integration rules to support. However, improving the quality of data at the point it is collected considerably increases the likelihood that this data will be used more effectively over a longer period of time.

Option 3: Deferred Validation

Of course, there are circumstances where this level of validation simply cannot be levied at the point of data collection. For example, an online retail organization will not want to turn away orders upon receipt because the address isn't in the right format. In such circumstances, a set of deferred validation routines may be the best approach. Validation still happens in the systems where the data is initially collected, but does not interfere with the business cycle.

Periodic Sampling Averts Future Disasters

The obvious theme of this article is to develop thorough data rules and implement them as close to the point of data collection as is feasible to ensure an expected level of data quality. But what happens when a new anomaly crops up? How will we know if it is slowly or quickly becoming a major problem?

There are many examples to follow. For example, the EPA has installed monitors of various shapes and sizes across the continental U.S. and beyond. The monitors take periodic samples of air and water quality and compare the sample results to previously agreed-upon benchmarks.

This approach proactively alerts the appropriate personnel when an issue arises and can assess the acceleration of the problem to indicate how rapidly a response must be facilitated.

We too must identify the data elements that contain the most critical data sources we manage and develop data quality monitors that periodically sample the data and track quality levels.

These monitors are also good indicators of system stability, having been known to identify when a given system component is not functioning properly. For example, I've seen environments in retail where the technology was not particularly stable and caused orders to be held in a pending status for days. A data quality monitor tracking orders by status would detect this phenomenon early, track its adverse effect and notify the appropriate personnel when the prestated threshold has been reached.

Data quality monitors can also be good business indicators. Being able to publish statistics on the number of unfulfilled orders due to invalid addresses or the point in the checkout process at which most customers cancel orders can indicate places where processes can be improved.

A sound data quality strategy can be developed in a relatively short period of time. However, this is no more than a framework for how the work is to be carried out. Do not be mistaken ­ a commitment to data quality cannot be taken lightly. It is a mode of operation that must be fully supported by business and technology alike.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access