Practical Tips to Analyze Your Data Quality
Information Management Special Reports, November 2002
Just as a house must have a solid foundation in order to be stable, an IT application needs good data quality in order to be useful. IT and the business community now seem to accept the importance of good data quality in their IT systems. However, data quality often seems to have a rather low priority in the actual implementation of many data warehouse and business intelligence (BI) projects. It is easier to explain and "sell" the user side of a BI project, i.e., the report and analysis tools presenting easily understood tables and, quite often, colorful charts. The fact that many organizations do not have a clue as to the state of their data makes the issue of data quality even more abstract and difficult to explain to these organizations. Describing back-office related data quality is certainly more difficult and less sexy than showing the visual front-office user tools. Nevertheless, no BI solution can function properly without good data quality.
It can be assumed that no operational database is perfect. It is, rather, a question of examining the level of data quality problems that exist, not denying the existence of problems all together.
The causes of data errors are numerous. Data may not exist where it should; and if it does, it may be wrong. A user may unintentionally make an error while typing a value. There may be bugs in the software running the applications, causing erroneous transactions. These errors can cause major damage to an organization in terms of additional costs and loss of revenue and good will. What is clear is that BI solutions make poor data quality much more obvious than operational systems. BI systems deal with aggregated and calculated data, scanning many transactions that may not be equally visible in an operational system.
Advertisement
Operational business rules, not adapted for BI applications, are also often to blame for poor data quality. If half of the customers in an operational database have 01011900 as default birth date when their real date of birth is unknown, the calculated average age for all the customers in the BI application shall be very wrong. This same principle applies to other types of calculations that normally need values for this unknown and which will be incorrect.
In these cases, the operational business rules that define how to treat unknown data need to be examined and sometimes changed in order for the BI application to be useful. If this is not possible, then the business community shall have to be informed about these default values. Otherwise, using these values can create problems when analyzing the information in the BI application. One thought to keep in mind is that the more explanation needed to teach an end user how to use a BI application is proportional to the risk that it will not be used to its full potential.
The good thing in all this is that the analyses needed to perform in order to have a basic comprehension of the present state of data quality in a system are quite basic. Many of them can be done by simply using SQL queries. These analyses are also important to better estimate the budget and deadlines for BI projects; as a major roadblock to successfully implement BI projects is poor data quality and the time and cost it takes to do the necessary corrections.
This article describes some basic, yet powerful and necessary, analyses necessary to estimate the level of data quality in a system. The results of these guidelines and their importance can be easily understood by IT as well as the business community – who can then more easily communicate and agree on what needs to be done in order to improve the data quality.
Basic Data Quality Analyses
The data quality analysis can be done relatively quickly by a company of any size. The analyses work on any system from flat files such as spreadsheets to enterprise- wide relational databases. Furthermore, these analyses can be versioned and reused, making it possible to see how effective the taken actions to correct the errors are.
Analyzing Individual Values
The most basic, yet powerful, data quality analysis consists of querying each and every individual column in a database that is to be used by BI applications. Each column is examined in order to find and sort all the individual values, together with their frequency. When the values are sorted numerically or alphabetically, depending on their type, non-existing, extreme and faulty values are usually found in the beginning or the end. These values are often default values entered into the operational system when the actual ones are not known.
Table 1 shows an example of birth dates, that have been sorted, and their frequency. Non-existing, suspicious or outright faulty birth dates can be found in the beginning or in the end of the table. Note that the date-format may have to be modified in order to allow for a correct sorting. In Table 1, the format is YYYYMMDD.

Table 1: Suspicious or Faulty Values Appear at the Top or Bottom of List
The analysis of individual values, also called the single column analysis, will detect data format errors and, to some degree, errors in naming conventions. The verification of address fields in a database is an example of a data format analysis. ZIP codes are something that usually has a specific format for a given country, for example five digits. In that case, they should not contain four digits and certainly not letters. Another example is street addresses: in some countries the street number is put before the street name and in others after. These street address rules can easily be detected by using SQL match-pattern functionality.
The single column analysis is of great importance when the data is to be used directly by the business community, including external parties, or when it is to be included in calculations. This is typically how data is used in BI applications. Because of this, columns with a high degree of non-existing values or default values may not even be of interest to these applications. If they are, it shall have to be considered how to present these data to the users.
The single column analysis can be used on any type of information, e.g., birth dates, ZIP codes and names. In order to present the results of the single column analysis, a table as shown in Table 1 can be used. A pie chart can also be used, making the presentation even more comprehensible to the business community by visualizing the state of the present data quality in a system.
Figure 1 shows an example that highlights the frequency of non- existing, suspicious or clearly erroneous values in a pie chart. Even though there may be many unique values, the existence of suspicious and faulty ones usually make up a part big enough to clearly distinguish itself among the correct values in a pie chart.

Figure 1: Pie Chart Representing Frequency of Non-Existing, Suspicious or Erroneous Data
Analyzing Coupled Values
Once the basic single column analysis has been done, it is time to analyze relations between different columns containing related information. This analysis, also called the double column analysis, verifies the correspondence between codes and descriptions or names, no matter whether it concerns products, customers, suppliers, etc.
The analysis will discover if a description has several codes, e.g., a customer name having different customer codes. This may indicate that the customer is duplicated, something that can result in additional costs when doing mailings as well as loss of good will when several letters are sent to the same person.
The double column analysis will also detect if a specific code has several descriptions. This is a common problem, meaning that a code may mean several things and thereby not be a unique identifier for products, customers, etc.
In doing this kind of analysis, comparisons between the data column containing the codes and the column containing the descriptions or names are done. Table 2 shows an example of the result of the double column analysis. The codes existing several times are shown first, sorted according to their frequency. Normally, a code should only exist once, having one label, or there should be a clear indication if there are different versions over time permitting a code to have several labels.

Table 2: Non- Unique Product Codes and their Frequency
For the presentation to the business community, Table 3 can be used. The column Product Name from Table 2 has been deleted and the column Product Code shows each value only once together with its frequency.

Table 3: Frequency of Unique Codes (Any frequency above 1 can be considered as an error.)
Figure 2 shows that a pie chart can also be used, visualizing the error rates even more clearly.

Figure 2: Proportion of Unique Codes and those Codes Existing More Often
Busting Ghosts
A classical case of data quality errors is business transactions for customers, salespeople or products that do not exist. Even though most data administrators will say that this is not possible because of the referential integrity in the system, experience has shown that this is nevertheless an important and omnipresent problem. What it means is, for example, that the code used for a product in a business transaction does not have a description in the reference product table. What is worse, the code may not exist at all in the product table. These ghost products will however add to any calculations done based on the transactions, e.g., total revenue, even though they are not referenced in their master tables.
Ghost data is easy to detect with an SQL query, taking in the transaction files all the customers, salespeople and products that do not have a label or even a code in their corresponding reference tables. The result, as is shown in Table 4, is a list with these ghosts, together with the number they represent in the transaction tables, e.g., number of units, sales revenue, etc.

Table 4: Customer Codes with Transactions and Sales Numbers
The sum of the transactions for these ghosts can be shown in a pie chart together with the sum of the correctly referenced transactions, in order to see how they skew the total result. Figure 3 below shows an example of such a pie chart.

Figure 3: Seven percent of the revenues cannot be derived from customers referenced in the master customer table.
Verifying Known Business Rules
A business rule tells either what is allowed or what is not permitted, in a business process. That salesperson A can only sell in region one is one example of a business rule, or that product X cannot be sold with more than 35 percent discount.
Most often, there are automatic controls in the system’s software that warns or blocks the user when a data entry does not conform to the defined business rules in a process (a process is a collection of business rules). However, usually these controls do not cover all the possible combinations of mistakes a user can make. One major reason for this is that new business rules tend to develop faster than the system administrators can manage. Data can, therefore, be entered even though it does not conform to a business rule.
Queries can be used to search for transactions that are not conforming to the business rules. An example would be to look for all transactions with salesperson A having sold in regions where he should not. If there are no results to this query, it means that this business rule is correctly implemented and followed.
There are tools that can automatically detect business rules in a system and also possible erroneous transactions. However, with SQL queries it is possible to get an understanding of the state of data quality in the system under examination.
Correcting the Data
Once the data quality analyses have been done, erroneous data has to be corrected. The four main approaches in doing this are:
- Changing the source systems routines for entering data in order to prevent erroneous data from being entered in the first place.
- Correcting erroneous data in the operational system.
- Correcting data before loading the data warehouse which is the base for BI applications.
- Correcting erroneous data in the BI application.
Obviously, option 1 is preferable. However, as this is not always possible for different reasons, such as lack of resources, options 2 and 3 become more interesting, followed by option 4. If option 2 is used, improved user training on how to enter data correctly can diminish the level of erroneous transactions. The method chosen depends on the organization’s resources and priorities.
It should also be noted that 100 percent correctness is often not possible to obtain within reasonable budget and time limits. The acceptable error rate depends on the business and what is expected of the system. (Data quality in the systems piloting airplanes is hopefully higher than for direct marketing databases.)
Administering the Analyses
All the analyses described can be fully automated and run at regular intervals. The SQL queries used can be stored in a repository, simplifying the administration of the analyses.
These analyses should then be run at regular intervals, in order to constantly verify the state of the data quality in the application being analyzed. Furthermore, the results should be compared to the previous analyses in order to see if there are any improvements.
It should also be kept in mind that the queries will be modified over time in order to adapt to new kinds of errors entered into the systems. (Unfortunately, there is no end to the number of errors that can appear in an application.) Therefore, it is necessary to version the analyses done.
The previously mentioned analyses will detect basic yet important data quality problems. These analyses can be done by any organization on both flat files and relational databases. The result can be understood both by IT and the business community, which will facilitate getting the political support needed in order to take corrective measures.
These analyses are also helpful when planning a BI project. Poor data quality is known to be a decisive factor when things go wrong. By assessing the data quality before launching a BI project, its impact on the project can be better estimated.
Please keep in mind that poor data quality is not one person’s responsibility or problem. Everyone in an organization is responsible for quality. A wish to do data quality analyses should not be hindered by people thinking that they will be held accountable for everything that is bad. If this is the attitude in the organization, it has more serious and immediate problems than doing data analyses, as these analyses will in that case be driven by politics instead of sound judgment.
Finally, a good data quality is of little use unless it is also comprehensible by the end users. Even though the quality may be good from the administrators’ point of view, not causing faulty calculations and other problems; it should at the same time be clear. Every added explanation needed to describe exceptional data or a non-intuitive data replacement value when the real value is unknown will complicate the comprehension for the business community. This risks diminishing the value of the BI application. In the end, a BI solution is only as good as its number of users.
Gabriel Fuchs is a senior consultant and business intelligence expert. His column Reality IT takes an ironic look at what real-world IT solutions often look like - for better or for worse. The ideas and thoughts expressed in this column are based on Fuchs' own personal experience and imagination and do not reflect the situation at any particular company. His book, Dealing with Nasty Colleagues: The Art of Winning in Office Politics While Still Getting the Job Done, can be ordered at www.amazon.co.uk. He can be reached at sgfuchs@bluewin.ch.
For more information on related topics, visit the following channels:







