For as long as I can remember, data quality has been defined as “fitness for use.“
Variants of this definition exist, which substitute “information” for “data“ and “expectations“ for “fitness,” but these variants all boil down to the same thing: fitness for use. This definition is important, because it affects the way we understand data quality and influences the way we try to deal with it. But suppose the definition is inappropriate. That would mean that we may not be dealing efficiently or effectively with data quality. I think a strong case can be made that the definition is indeed inappropriate and should be replaced with a better one.
The Role of Data
Before we get into the definition of data quality, let us take a brief look at what data is related to. Figure 1 (see left) summarizes this situation.
Data represents something: a thing, event or concept. Data is understood by something, for which the best term I can find is the “interpretant.” This term was invented by C.S. Peirce and is described by the Stanford Encyclopedia of Philosophy, as follows:
“The interpretant of a sign is said by Peirce to be that to which the sign represents the object.”
I am not going to discuss this further, but the interpretant can be thought of as a mind or machine that can “understand” the sign (that is, the data). The interpretant applies the data to one or more uses, which achieve objectives the interpretant has.
The Uses of Data
From Figure 1, we can see that the interpretant is independent of the data. It understands the data and can put it to use. But if the interpretant misunderstands the data, or puts it to an inappropriate use, that is hardly the fault of the data, and cannot constitute a data quality problem.
For instance, I once made the mistake of thinking that State of Billing Address was really State of Residence. I had no data dictionary for the data I was working with, and I guessed what it meant - incorrectly as it turned out. That was my problem, not the data’s. On another occasion, I used Credit Card Overlimit Fees as part of the calculation of Finance Charges, again inaccurately. The data quality of Credit Card Overlimit Fees was good, and I correctly understood what it meant, but it was not supposed to be used for calculating Finance Charges. Using the data for something it was not supposed to be used for was, again, my fault and not a data quality problem.
Another issue with data that is “fit for use” occurs when data is deliberately faked in order to gain some kind of advantage. This has been known to happen in scientific disciplines. For instance, the journal Science recently reported that Diederik Stapel of Tilburg University in the Netherlands (known to his colleagues as the “Lord of the Data”) admitted falsifying data in many of his published papers. Presumably, Stapel gained some advantage from this behavior. So the data was fake, but perfectly fitted to Stapel’s use of it.
A more fundamental problem is that data can have many uses. If we think that data quality is “fitness for use,” then data quality must be assessed independently for each use we put it to. This, in turn, means that “data quality” cannot be a property of data itself, but must be an aspect of each different relationship between the data and the particular uses to which it can be put. I am not arguing that this is wrong, but most data professionals I have worked with think of data quality as a property of data itself and not something that is independently assessed on a use-by-use basis.
So What is Data Quality?
If “fitness for use” is not a good definition, then what is? Going back to Figure 1, I would propose that data quality is an expression of the relationship between the thing, event, or concept and the data that represents it. This is a one-to-one relationship, unlike the one-to-many relationship between data and uses. Therefore, I would propose the definition of data quality as: “the extent to which the data actually represents what it purports to represent.”
This definition can be used to think of data quality as a property of the data itself, which seems more natural. It also alludes to the existence of metadata that links the data to what it is representing, part of what the term “purports” is intended to convey. Such metadata may exist formally, or may merely be informal conventions known within some kind of community. But this kind of metadata must exist somewhere or the data will be unusable.
What about Fitness for Use?
While I think that we have found a more appropriate definition of “data quality,” I do not want to deny the existence of a set of valid concepts that deal with types of problems around the use of data. These types of problem include when:
- The interpretant misunderstands the data.
- The interpretant uses data for a purpose that is incompatible with the data.
- Data is faked and used for illegal or unethical purposes.
There are probably additional types of problem related to use of data, too. I do not think that we can classify these kinds of problems as issues of “data quality.” It would be better to find a different set of terms to identify them.
If we accept the definition of data quality that I have proposed, then our diagnosis and remediation efforts will focus on the special problems of the relationship between data and what it represents. The special problems of the relationships between data and what it is used for will require a different set of approaches and should be called something other than “data quality.”