Classifying Data Quality
Data quality as a term can be difficult to pin down, though we all have a pretty good idea of what we mean when we use the phrase. Let's start with a simple premise: there is no absolute standard that constitutes quality data. Rather, the quality of data is simply a measure of its fitness for purpose and can vary widely, depending on the purpose at hand.
I have seen retail customers rigorously cleanse cash register logs of duplicate card swipes and transactions in order to derive good quality data for their data warehouse. Their analytics and reports required that these anomalies be purged or the inventory and sales data would not reconcile. However, when the analysts added fraud detection to their requirements, they found duplicate card swipes to be a very useful indicator; fraudsters might try several cards or try one card several times in an attempt to make a purchase. Good quality data for one purpose was useless for another.
With no absolute standard of data quality, there have been several approaches taken to defining the problem space. Most successful has been the work of Richard Wang and others in identifying data quality dimensions.1 These dimensions define ways in which data quality can vary. The most commonly used dimensions are shown in Figure 1.
Figure 1: Data Quality Dimensions
These dimensions are very useful, and there are more, such as security, availability and relevancy. However, they often raise another interesting and important question: what does accurate mean? If we say that to be accurate is to be correct, then we are simply begging the question. It is not enough to have dimensions of data quality. We also need premises on which to base our judgments of how good or bad data is when measured along those dimensions.
What are the premises of data quality? I find it most useful to work with the four premises listed in Figure 2.
Figure 2: Data Quality Premises
Premises and dimensions are powerful tools for clarifying data quality issues, especially in combination. For example, attend to the authority premise and the consistency dimension to identify cases where a five character ZIP code is not enough for your needs, and a 5+4 code is required. Or, look to the patterns premise and the timeliness dimension to identify customers who either do not have recent contacts recorded correctly in the CRM system or who have actually not been contacted for some time.
Using Patterns as a Data Quality Premise
By now, you can likely see that some dimensions and premises are inherently easier than others to evaluate. Authority and accuracy are certainly straightforward together - ZIP codes, for example, need an accurate reference source. Data quality tools have traditionally been strong in these easy cases. Data integration and extract, transform and load (ETL) tools have often been adequate for applying single object-based rules, such as column constraints, even before data reaches the target database. However, pattern-based data quality has not received the same attention although some tools have had simple self-learning capabilities, which help in discovering patterns to an extent.
Expect to see patterns become more important, especially as the forms of data handled by enterprise decision support systems grow more diverse. New data breaks new ground, and its underlying rules may be only partially understood, even by domain experts. Bioinformatics in the scientific field and free-text data mined for marketing from sources such as user blogs are both good examples where boundaries are being broken right now.
Patterns can also be used with more traditional data, for even well-understood data can have rules that are difficult to capture clearly or comprehensively. An automobile manufacturer monitors warranty repair data from dealers, looking for errors, potential recall issues and possibly dubious transactions by dealers looking to slip some charges under the radar. For example, a six-month-old luxury car showing corrosion should raise an alert for further investigation. That rule could be coded quite easily. However, a worn constant-velocity joint would be less surprising in a high-mileage car than in a low-mileage car and may relate to other symptoms, such as tires or suspension.
In such cases, identifying and coding all possible rules to be evaluated would be hugely expensive. In reality, what is needed to assess such complex cases is already there, in the data. If an analyst personally reviews warranty invoices and sees the same dealer reporting corrosion in three luxury cars in a month, they may well be surprised and suspicious. Why? Simply because such cases do not match their pattern of experience. They use a pattern as a premise for assessing the quality of the data. So, patterns can be useful - but how can they be applied in practice?
Data Mining and Data Quality Patterns in SQL Server 2005
Fortunately, there is a way to identify patterns in your data - the techniques of data mining. The model created by a data mining application is, in essence, a learned pattern of experience based on existing data. In the case of our automobile manufacturer, a model capturing association rules present in all warranty repairs to date could quickly identify a corrosion-prone luxury car as an outlier needing further investigation. Similarly, a credit-scoring application may use a clustering model to identify a 21-year-old father of five as an unlikely holder of a gold credit card; he may not fit any existing clusters of known good customers. In both cases, hard-coded rules may be difficult to pin down or simply cumbersome to maintain; but in both cases, a mining model can efficiently identify data that does not fit known good patterns. Better still, by continuously training and retraining the models, new patterns can be found and existing patterns reinforced, enabling a genuinely adaptive solution to complex problems.
In SQL Server 2005, Microsoft has taken this pattern-based approach one step further - or perhaps it is better to say, one step nearer the source. Data mining tools, including those from Microsoft in earlier versions of SQL Server, have traditionally acted in a rather standalone manner, analyzing static data sets and enabling the results of the analysis to be queried. At best, data mining could be integrated as part of the workflow of decision-making in an organization.
In SQL Server 2005, which features the new SQL Server Integration Services application, Microsoft has enabled data mining as an active element not only in workflow, but also in the data flow within an enterprise. Integration Services cannot only populate and train mining models, but the application can also call out to a mining model to assess data, row by row, as a seamlessly integrated stage of the ETL or data movement process. Further, the data mining query can return a wide range of data and metadata that can be subsequently used to redirect or to further transform rows appropriately.
We can contrast this with a classic ETL scenario in which the data warehouse is populated with data that is subsequently mined for added value. In SQL Server 2005, data mining can take place while the data is "in flight" to the data warehouse. This has several advantages, in addition to the uniquely adaptive approach to data quality already described. For example, in-flight data mining for data quality can enable alerts to be raised much earlier in the analytic process: in fact, with suitably low latency data integration techniques, data mining can be part of your operational processes.
Looking back to our data quality dimensions and premises, what can we enable by mining for quality? Accuracy and consistency are natural dimensions to consider. But data mining can also be used carefully to ensure completeness. In its traditional roles, data mining is often used as a predictive process, inferring likely new values from known data. When used in the data integration process, this capability can enable us to infer tentative values for missing data that may be useful in the data warehouse. For example, it may be useful to have information on householders in your customer base, but that information may not be available for all your customers. Data mining in the data flow could impute whether a given customer record represented a householder based on a model that captures the patterns of customers whose status is already known. Naturally, you will want to populate a flag column to ensure that you are aware that this data is inferred - in fact, with SQL Server Integration Services and data mining, you can go further and capture the probability that the prediction is correct and other useful metrics.
Practitioners are consistently looking for new tools and techniques to strengthen their data quality support. Dimensions and premises are useful conceptual aids when addressing quality issues. For practical applications, integration of data mining with data integration - as in Microsoft SQL Server 2005 - adds a new, adaptive and remarkably effective capability for complex and subtle data quality problems that previously were not only difficult to solve, but difficult to articulate.
- Y. Wand and R. Y. Wang. "Anchoring Data Quality Dimensions in Ontological Foundations." Communications of the ACM 39 , no. 11, 86-95.
Donald Farmer is group program manager, SQL Server Integration Services, Microsoft Corporation. He may be reached at
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access