AUG 16, 2012 5:15am ET

Related Graphic

Related Links

How to Effectively Outsource BI
May 17, 2013
Tableau Closes High in Stock Market Debut
May 17, 2013
Cisco Bests Profit Estimates on Surging Network Data Demand
May 16, 2013

Web Seminars

IBM & Teradata Compared: A Total Cost of Ownership Study
May 22, 2013
What Is Data Science? You Might Be Surprised!
June 3, 2013
AARP: Embracing Dynamic, Agile Analytics Platforms for Big Data
June 5, 2013
column

Data Quality is Not Fitness for Use

Print
Reprints
Email

For as long as I can remember, data quality has been defined as “fitness for use.“

Variants of this definition exist, which substitute “information” for “data“ and “expectations“ for “fitness,” but these variants all boil down to the same thing: fitness for use. This definition is important, because it affects the way we understand data quality and influences the way we try to deal with it. But suppose the definition is inappropriate. That would mean that we may not be dealing efficiently or effectively with data quality. I think a strong case can be made that the definition is indeed inappropriate and should be replaced with a better one. 

The Role of Data

Before we get into the definition of data quality, let us take a brief look at what data is related to.  Figure 1 (see left) summarizes this situation. 

Data represents something: a thing, event or concept. Data is understood by something, for which the best term I can find is the “interpretant.” This term was invented by C.S. Peirce and is described by the Stanford Encyclopedia of Philosophy, as follows:

“The interpretant of a sign is said by Peirce to be that to which the sign represents the object.” 

I am not going to discuss this further, but the interpretant can be thought of as a mind or machine that can “understand” the sign (that is, the data). The interpretant applies the data to one or more uses, which achieve objectives the interpretant has.

The Uses of Data

From Figure 1, we can see that the interpretant is independent of the data. It understands the data and can put it to use. But if the interpretant misunderstands the data, or puts it to an inappropriate use, that is hardly the fault of the data, and cannot constitute a data quality problem.

For instance, I once made the mistake of thinking that State of Billing Address was really State of Residence. I had no data dictionary for the data I was working with, and I guessed what it meant - incorrectly as it turned out. That was my problem, not the data’s. On another occasion, I used Credit Card Overlimit Fees as part of the calculation of Finance Charges, again inaccurately. The data quality of Credit Card Overlimit Fees was good, and I correctly understood what it meant, but it was not supposed to be used for calculating Finance Charges. Using the data for something it was not supposed to be used for was, again, my fault and not a data quality problem.

Another issue with data that is “fit for use” occurs when data is deliberately faked in order to gain some kind of advantage. This has been known to happen in scientific disciplines. For instance, the journal Science recently reported that Diederik Stapel of Tilburg University in the Netherlands (known to his colleagues as the “Lord of the Data”) admitted falsifying data in many of his published papers. Presumably, Stapel gained some advantage from this behavior. So the data was fake, but perfectly fitted to Stapel’s use of it.

A more fundamental problem is that data can have many uses. If we think that data quality is “fitness for use,” then data quality must be assessed independently for each use we put it to. This, in turn, means that “data quality” cannot be a property of data itself, but must be an aspect of each different relationship between the data and the particular uses to which it can be put. I am not arguing that this is wrong, but most data professionals I have worked with think of data quality as a property of data itself and not something that is independently assessed on a use-by-use basis. 

So What is Data Quality?

If “fitness for use” is not a good definition, then what is? Going back to Figure 1, I would propose that data quality is an expression of the relationship between the thing, event, or concept and the data that represents it. This is a one-to-one relationship, unlike the one-to-many relationship between data and uses. Therefore, I would propose the definition of data quality as: “the extent to which the data actually represents what it purports to represent.”

This definition can be used to think of data quality as a property of the data itself, which seems more natural. It also alludes to the existence of metadata that links the data to what it is representing, part of what the term “purports” is intended to convey. Such metadata may exist formally, or may merely be informal conventions known within some kind of community. But this kind of metadata must exist somewhere or the data will be unusable. 

What about Fitness for Use?

While I think that we have found a more appropriate definition of “data quality,” I do not want to deny the existence of a set of valid concepts that deal with types of problems around the use of data. These types of problem include when:

  • The interpretant misunderstands the data.
  • The interpretant uses data for a purpose that is incompatible with the data.
  • Data is faked and used for illegal or unethical purposes.

There are probably additional types of problem related to use of data, too. I do not think that we can classify these kinds of problems as issues of “data quality.” It would be better to find a different set of terms to identify them.

Filed under:

Advertisement

Comments (8)
Dear Mr. Chisholm, Thanks for a really great article that puts a difficult concept into everyone's understanding! Something that really explains well, the overall premise for data, data quality and the target of correct design. The article, without directly saying so, portrays that all data projects require the upfront business/requirements analysis to better the chances that the correct data is collected. I agree with the inclusion that metadata is key. There are so many projects where the deciders like to skip these pertinent steps as once something is decided, the overall inclination is to 'charge ahead', full throttle, without sufficient exploration / documentation. I'm interested in your interpretation on a question/idea I have ; imagining if I wanted to put the diagram elements (from figure 1) to a data model, to me, Interpretation characteristics would describe to the overall idea of how data should/would be used vs Interpretant characteristics that would probably describe a role, a person ,or, an algorithm of some type. What would your thoughts be on altering on one of the relationships in the diagram to 'Interpretation' from 'Interpretant' having a left-to-right verb association like 'Data is understood as (an) Interpretation' and subsequently, the Interpretation applies Data to 'a' particular use? Whether or not the use is correct would be dependent on the interpretation. Thanks again.
Posted by Robert A | Thursday, August 16 2012 at 1:37PM ET
I concur in offering thanks for an excellent article. Based on my experience, I'd suggest a better title would be "Data Quality is Not Just Fitness for Use". If I may formalize your points (and maybe carry things a bit further), there are three overlapping criteria for data quality: (1) Adequate/Correct representation of the real-world things to be represented. (2) Correct values of attributes compared to real-world. (3) Appropriateness for intended use.

"Fitness of use" is an assessment that cuts across all three criteria, and which gives a framework within which to judge quality.

Let's use your mistake about billing address as an example. If your data model didn't include apartment or other subaddress elements, it would not correctly represent occupancy units within apartment or office complexes (a type 1 problem). If my record had an address number of "502" instead of the correct value of "500", the contents are wrong (a type 2 error). If you intend to mail a brochure to current customers, the data set is clearly relevant, but it you want to place door hangers it is not at all relevant (a type 3 error).

Posted by Mike W | Thursday, August 16 2012 at 2:45PM ET
Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Where do young IT professionals (30 and under) obtain information to aid with daily role responsibilities and career development?

Trade publication websites 14%
Social media 23%
Vendor websites 4%
Vendor/community forums 7%
Newsletters 1%
Trade conferences/meetups 2%
RSS feeds 6%
Web search 44%

 

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.