A major bank scrapped a $29 million data warehouse to start from scratch because they failed to understand and avoid the mistakes described in this column. This is a tragedy because three years ago, they were warned that their data warehouse plan was flawed. However, the real tragedy is the lost opportunity caused by the nonquality data in the warehouse.

This column contains two of the "Ten Mistakes to Avoid If Your Data Warehouse Is To Deliver Quality Information." (A white paper of the same title is now available at www.infoimpact.com.)

The definition of a wise person is "someone who learns from the mistakes of others." Are organizations being wise? Are they learning from their own data warehousing mistakes, and will they learn from the mistakes of others?

Mistake: Assuming the source data is okay because the operational systems seem to work just fine. An insurance company was unpleasantly surprised with its data warehouse analysis of newly loaded claims data. The data showed that a whopping 80 percent of the company's claims were paid for a diagnosis of "broken leg." One quick call revealed an all-to-common occurrence: The claims processors routinely (80 percent of the time) allowed the system to default to a diagnosis code of "broken leg" because this was faster than entering a code, and they were paid for the number of claims they processed per day. This is the root cause of many information quality problems.

The valid but inaccurate data value of "broken leg" was of sufficient quality to pay a claim. However, the same data was not "quality data" for the process of analyzing risk.

Operational data quality will be of sufficient or minimum quality to support the operational processes within the performance measures of the information producers, but not necessarily of acceptable quality for other processes.

This dramatically illustrates the absolute denial most organizations exhibit as to the extent of their information quality problems. Most organizations assume that because the operational processes (seem to) perform properly, the same data must be okay to satisfy decision-support processes.

Because most of today's applications and databases were designed to support only departmental, functional needs, they may contain "quality" data for those operational processes only. There are many root causes for this including, but not limited to:

  1. Obsolete and single-purpose database designs that support a functional view by definition will not support general-purpose information requirements. Credit card applications that only need to know month and year of birth date will never support the touchpoint requirement to send a birthday announcement.
  2. The, "this is our data; we use it the way we want to" mentality.
  3. Performance measures that reward for speed regardless of the cost of nonquality information on downstream information customers and processes.

For these reasons, you must analyze the state of quality of the source databases or files. Your information quality assessment should include:

  1. Basic data analysis or profiling (such as completeness), value frequency distribution, outlier analysis of low occurrence values and reasonability analysis (e.g., consistency of title and gender).
  2. Validity assessment to determine that you have a valid value (as defined) and that it conforms to all business rules (e.g., only females may have a medical procedure of hysterectomy).
  3. Accuracy assessment that compares the data values to the real-world objects or events. The accuracy assessment confirms that the value is the correct value (e.g., the data value of a hysterectomy procedure wasn't actually an appendectomy).

While this last assessment is the most difficult, it should be performed for all A- priority and most B-priority data. One company found no invalid values for Marital-Status-Code, but discovered that 23.3 percent of those "valid" values were incorrect! Without an accuracy assessment, you may have a false sense of security as to the true quality of the data.

A quality assessment will inform you as to the state of quality of the data. By conducting prototype data-correction activities, you can estimate the level of effort for corrective maintenance.

Mistake: Focusing on performance more than information quality in data warehousing. The name implies that the product of a data warehouse is "data." Actually, the product is information derived from data that supports important decisions, identifies trends that lead to significant new opportunities and identifies potential problems before they become enterprise- threatening problems. Information quality (consistently meeting all knowledge- worker and end-customer expectations) will always be the most important objective in the data warehouse project.1 However, performance and response time are important (yet not to the degree that they could cause enterprise failure, as is the case with information quality).

Of what value is it if you get an answer to a query in five seconds instead of five hours if the result is wrong? If a five-second response time is a real requirement, chances are good the data warehouse is solving the wrong problem ­ one that must be solved in the operational system.

The worst- case result of this mistake is that the enterprise will make bad decisions faster than ever. The best-case scenario is that the very customers the data warehouse seeks to satisfy will not be satisfied. They will not use ­ or will minimize their use of ­ the data warehouse, leaving you with credibility problems for the next project you propose.

While response-time performance is a requirement, it must always be secondary to the reliability of the information results. There is an important difference in transaction performance when you are serving a customer versus when you are analyzing data to support a decision. Timeliness of information delivery, or response-time performance, must be balanced with the other information quality characteristics of completeness, accuracy, definition completeness and correctness, and presentation clarity and intuitiveness. The wrong answer to a query in superior response time is still a wrong answer. It can cause a lack of credibility in the data warehouse and threaten the success of the enterprise.

The product of the data warehouse is, after all, information organized to support the strategic and tactical processes of the enterprise such as trend analysis, key performance indicators or core business indicators, and decisions that impact the future of the enterprise.

Information quality, in all its various characteristics (accuracy, completeness and timeliness) must be a primary goal of the data warehouse. Making either of these two mistakes, or any of the eight others, can cause risk to both the data warehouse project and your enterprise.2

What do you think? Let me know at Larry.English@infoimpact.com.

References:

  1. English, Larry P. Improving Data Warehouse and Business Information Quality. New York: John Wiley & Sons, 1999.
  2. "Ten Mistakes to Avoid If Your Data Warehouse Is To Deliver Quality Information," white paper available from www.infoimpact.com.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access