A recurrent theme in data quality is the idea of "dimensions," such as accuracy, consistency and timeliness. The fact that these dimensions actually exist is rarely, if ever, questioned. Indeed, the benefits of the dimensions are regularly discussed, and are commonly thought to include:
- Allowing the complex area of data quality to be subdivided into areas, each of which has its own particular way of being measured;
- Being able to correlate dimensions with specific impacts on business areas;
- Having specific remediation approaches, rather than a one-size-fits-all methodology.
There may also be additional benefits. But do these “dimensions” actually exist as intelligible concepts? I believe a strong case can be made that we are not thinking as clearly as we can be in this area, and that there is room for improvement.
What is a Dimension?
In physics, a dimension refers to the structure of space and possibly how material objects are located in time. But what about data quality? Data quality is not an object, like a solid body, and it is not the subject of a higher natural science, like physics. So where does the term “dimension” come from when we talk about data quality? I would suggest that in the context of data quality, dimension is being used as an analogy; the term gives the impression that data quality is as concrete as a solid object and that the dimensions of data quality can be measured. After all, the physical dimensions of length, breadth and depth can be measured.
In data quality, the term dimension could be used interchangeably with criterion, a standard of judgment. Dimension gives the false appearance of something scientific, but the natural sciences cannot apply to data management, since data is immaterial.
That the dimensions can be measured is an astonishing claim, and I keep falling off my chair when people tell me they have, for instance, a measured level of 80 percent data conformity. Measurements are comparisons to an agreed standard, like a meter used to be a proportional comparison of any length to a bar of platinum kept at freezing point in Paris (it is now based on light waves). I agree that quantification of aspects of data quality is needed, but calling these “measurements” gives a cachet that seems hard to justify, especially if we cannot even agree on what “conformity” (for example) is.
Are the Dimensions Credible?
Fuzzy thinking about the dimensions of data quality is rather common. In lists of dimensions, it is not uncommon to see “duplication” listed alongside others, such as “completeness” and “consistency.” Yet it is immediately obvious that the more duplication there is, the lower data quality likely is, while the more completeness there is, the higher data quality is. Thus, the inclusion of “duplication” in a list of dimensions of data quality immediately creates a lack of consistency in the list. If such a list is itself inconsistent, how can it credibly be used to assess data quality?
This is not a word game either. It might seem easy to simply change the term to non-duplication, but duplication properly exists nearly everywhere. It is the basis of ETL processes. Uncontrolled duplication is a bad thing, but controlled duplication can be beneficial. So just what is the problem here? Suppose an item of data is perfect in all respects but is duplicated in an uncontrolled fashion. There might be an inefficiency of data storage and a latent risk of synchronization issues, but surely no data quality problem, yet. This lack of intelligibility in the definition cannot be fixed by simply changing one term into another, and no practical actions for data quality improvement can be inferred from it (since it is not covering data quality).
A much more serious problem is that there seems to be no common agreement on what the dimensions of data quality actually are. Table 1 shows a comparison of some of the dimensions provided by both the Enterprise Data Management Council and the International Association of Information and Data Quality. Note that, in Table 1, the IAIDQ entry for "Completeness" has more detail in the breakdown, omitted here to save space, and all of the EDM Council definitions have a set of examples not reproduced here.
What is striking about Table 1 is the divergence in the definitions of the dimensions. For the EDM Council, “completeness” means "missing values, but for the IAIDQ it includes missing records (the sub-dimension of “Occurrence Completeness”). Missing records are the full dimension of “Coverage” for the EDM Council.
Incompatible ontologies are nothing new, but such an observation is distinctly unhelpful. If there is no clear common understanding of the dimensions, then the data quality discipline is a Tower of Babel, precisely because the dimensions are held to be so foundational.
Are the Dimensions Overabstractions?
A further worry is that each dimension is not a single concept, but is either a collection of disparate concepts or a generalization. The IAIDQ definition of “Completeness” is definitely the former, with “Fact Completeness” covering the attributes of an entity actually captured, “Value Completeness” covering actually populated columns and "Occurrence Completeness" covering records captured in a table. There is really no commonality of behavior, consequence or root cause here. The EDM Council definition of "Conformity" is a generalization, appearing to cover all data-centric aspects of definitions; it is at too high a level to infer practical actions.
A lot more could be said. Due to these issues, I’m very suspicious of anything I am told about dimensions of data quality. I agree that the benefits listed at the top of this article are sorely needed, but I do not think that the dimensions of data quality help us to achieve them; they might actually be distractions.