Rethinking the Dimensions of Data Quality
A few months ago, I wrote a column asking if the dimensions of data quality, such as accuracy, consistency and timeliness, are real. I pointed out that there are no generally accepted definitions for the dimensions, no generally accepted exhaustive list of them and no generally accepted methodologies for measuring each one.
Since the column was published, I have been "encouraged" to say something a little more positive on this topic – something that will help practitioners deal with the daunting problems of data quality. I agree that being negative is not that helpful, although it is refreshing to have a frank conversation about what really underlies terms that are often thrown about our industry.
The Road to Abstraction
One argument in favor of having dimensions of data quality is that data quality is such an enormous space that we cannot deal with it effectively unless we break it down into subareas. I think it is better to say that data quality represents a large, complex set of issues and that we need to tease out individual types of issues, each with its own specific problems and requiring its own specific methods to deal with it. This is more of a bottom-up approach.
However, it seems to me that the first view prevails. The top-down approach is repeatedly taken, with attempts made to break data quality up into different dimensions. Here I am talking from experience: I have seen this approach first-hand at a number of conferences and industry initiatives.
What this top-down approach focuses on are abstractions, and we need to understand abstractions to appreciate what is going on. There are many different classes of abstraction, but the one involved here is the process of turning a property into an object. This is a source of argument among philosophers, going back to Plato, who believed that such abstractions really exist in some part of the universe, just as much as material objects do.
This form of abstraction is illustrated in the following example. Imagine I am represented by a customer record in a database of Enterprise X. This record holds my date of birth. If the value is my actual date of birth, we can agree it is completely accurate. If the month and year of this value are correct, but not the day, then we can say it is reasonably accurate. If the year is correct, but not the month and day, we can say it is moderately accurate. If day, month and year are incorrect, we can say it is not accurate at all. We are using the term "accurate" to describe the quality of the relationship between the data value and the reality it is trying to represent.
Human beings then make the leap from using "accurate" as an adjective to using "accuracy" as a noun. Our language allows us to do that, but that does not mean reality has to go along with us. We have created a type of abstraction and this type of abstraction (a) is not instantiated, (b) does not bear properties and (c) cannot enter into causal relationships. The concept "dog" is instantiated in my pet Leo, who weighs about eight pounds and knows perfectly well how to manipulate me into feeding him. The concept "accuracy" is not instantiated anywhere. We do not see "accuracies" lying around in the universe, or having attributes like color or weight, and “accuracy” does not enter into causal relationships. We can say that the birth date example above "represents an instance of accuracy," but this is a bewitchment of our language. Just because we can say it does not make it so. It’s better to say the example "has the property of being accurate."
But We Need the Dimensions of Data Quality
So far, this is still of little help to the practitioner. We know that data quality is large and complex, and it is our duty to improve it as much as we can for the enterprises we work for.
Here, I think the bottom-up approach is better. This does not view a dimension like "accuracy" as an abstracted object with a single definition. Rather, each dimension is a complex area that has its own structure, problems and methods.
If I think about accuracy, I can ask a series of questions, such as:
- Is the thing being represented covered by the definition used for the entity/table in the database?
- Is the attribute of the thing being represented covered by the definition used for the attribute/column in the database?
- Does the thing represented by a record in the table in the database actually exist?
- Does the value held in the column of the record in the table in the database objectively represent the expression of the attribute of the thing?
- Does the value held in the column of the record in the table in the database subjectively represent the expression of the attribute of the thing to the extent needed to meet business requirements?
There are very likely even more questions pertaining to accuracy that can be asked. This shows that accuracy is a complex of many concepts, not a single concept. If you object that each question needs to be broken out as a different dimension, then you are going to end up with an awful lot of dimensions, as I have many such questions for each of the traditional dimensions. The traditional dimensions give the illusion that each is only answering one question, because somehow each has a single definition.
The questions cover the structure and problems of accuracy. Methods are another facet that needs to be addressed. Some methods include:
- Testing the entity/table definitions with data producers to see if they correctly classify instances to the entity/table.
- Checking that the attribute/column has not been deliberately repurposed by operational staff to hold something other than what the official definition describes.
- Sampling records and auditing that these represent real-world instances.
- Sampling data values and independently measuring the attributes of the things they represent.
- Sampling data values, independently measuring the attributes of the things they represent and comparing these to the tolerances allowed in each business use case.
I suggest that this bottom-up approach is the way to deal with the scale and complexity of data quality. Each dimension is not a single concept with a single definition, but a complex area with its own logical structure, specific problems and special methods.