JAN 22, 2013 8:55am ET

Related Links

Are the Dimensions of Data Quality Real?
Big Data Shakes Traditional BI Perspectives

Web Seminars

How to Run a Successful Bring Your Own Device (BYOD) Scheme
June 18, 2013
IBM MobileFirst Management: Empower Your Mobile Workforce
June 25, 2013
Hybrid Cloud Storage: Getting the Best of Two Worlds
June 26, 2013
column

Rethinking the Dimensions of Data Quality

Print
Reprints
Email

A few months ago, I wrote a column asking if the dimensions of data quality, such as accuracy, consistency and timeliness, are real. I pointed out that there are no generally accepted definitions for the dimensions, no generally accepted exhaustive list of them and no generally accepted methodologies for measuring each one.

Since the column was published, I have been "encouraged" to say something a little more positive on this topic – something that will help practitioners deal with the daunting problems of data quality. I agree that being negative is not that helpful, although it is refreshing to have a frank conversation about what really underlies terms that are often thrown about our industry.

The Road to Abstraction

One argument in favor of having dimensions of data quality is that data quality is such an enormous space that we cannot deal with it effectively unless we break it down into subareas. I think it is better to say that data quality represents a large, complex set of issues and that we need to tease out individual types of issues, each with its own specific problems and requiring its own specific methods to deal with it. This is more of a bottom-up approach.

However, it seems to me that the first view prevails. The top-down approach is repeatedly taken, with attempts made to break data quality up into different dimensions. Here I am talking from experience: I have seen this approach first-hand at a number of conferences and industry initiatives.

What this top-down approach focuses on are abstractions, and we need to understand abstractions to appreciate what is going on. There are many different classes of abstraction, but the one involved here is the process of turning a property into an object. This is a source of argument among philosophers, going back to Plato, who believed that such abstractions really exist in some part of the universe, just as much as material objects do.

This form of abstraction is illustrated in the following example. Imagine I am represented by a customer record in a database of Enterprise X. This record holds my date of birth. If the value is my actual date of birth, we can agree it is completely accurate. If the month and year of this value are correct, but not the day, then we can say it is reasonably accurate. If the year is correct, but not the month and day, we can say it is moderately accurate. If day, month and year are incorrect, we can say it is not accurate at all. We are using the term "accurate" to describe the quality of the relationship between the data value and the reality it is trying to represent. 

Human beings then make the leap from using "accurate" as an adjective to using "accuracy" as a noun. Our language allows us to do that, but that does not mean reality has to go along with us. We have created a type of abstraction and this type of abstraction (a) is not instantiated, (b) does not bear properties and (c) cannot enter into causal relationships. The concept "dog" is instantiated in my pet Leo, who weighs about eight pounds and knows perfectly well how to manipulate me into feeding him. The concept "accuracy" is not instantiated anywhere. We do not see "accuracies" lying around in the universe, or having attributes like color or weight, and “accuracy” does not enter into causal relationships. We can say that the birth date example above "represents an instance of accuracy," but this is a bewitchment of our language. Just because we can say it does not make it so. It’s better to say the example "has the property of being accurate."

But We Need the Dimensions of Data Quality

So far, this is still of little help to the practitioner. We know that data quality is large and complex, and it is our duty to improve it as much as we can for the enterprises we work for.

Here, I think the bottom-up approach is better. This does not view a dimension like "accuracy" as an abstracted object with a single definition. Rather, each dimension is a complex area that has its own structure, problems and methods.

If I think about accuracy, I can ask a series of questions, such as:

  • Is the thing being represented covered by the definition used for the entity/table in the database?
  • Is the attribute of the thing being represented covered by the definition used for the attribute/column in the database?
  • Does the thing represented by a record in the table in the database actually exist?
  • Does the value held in the column of the record in the table in the database objectively represent the expression of the attribute of the thing?
  • Does the value held in the column of the record in the table in the database subjectively represent the expression of the attribute of the thing to the extent needed to meet business requirements?

There are very likely even more questions pertaining to accuracy that can be asked. This shows that accuracy is a complex of many concepts, not a single concept. If you object that each question needs to be broken out as a different dimension, then you are going to end up with an awful lot of dimensions, as I have many such questions for each of the traditional dimensions. The traditional dimensions give the illusion that each is only answering one question, because somehow each has a single definition.

The questions cover the structure and problems of accuracy. Methods are another facet that needs to be addressed. Some methods include:

  • Testing the entity/table definitions with data producers to see if they correctly classify instances to the entity/table.
  • Checking that the attribute/column has not been deliberately repurposed by operational staff to hold something other than what the official definition describes.
  • Sampling records and auditing that these represent real-world instances.
  • Sampling data values and independently measuring the attributes of the things they represent.
  • Sampling data values, independently measuring the attributes of the things they represent and comparing these to the tolerances allowed in each business use case.

Advertisement

Comments (7)
Great post Malcolm. Something most organizations should be thinking about - data quality beyond merely accuracy and completeness, and how to assess all these dimensions. So it should come as no surprise to you that Gartner's leading data quality expert Ted Friedman and I developed a comprehensive toolkit for assessing over a dozen different data quality dimensions, including how to quantify them and track their improvement/degradation over time. Here is an overview, with full access for Gartner clients: http://www.gartner.com/resId=2171520 . --Doug Laney, VP Research, Gartner, @doug_laney
Posted by Douglas L | Tuesday, January 22 2013 at 11:25AM ET
Hi Malcolm; Interesting post, but I fear you may have muddied the waters...What DQ practitioners need to begin thinking about is what I call elements, associations and narratives.

Your date example is a good place to start. A date element can either be valid or not. 01-31-13 is valid; 04-31-13 is not. All of the elements used in an enterprise setting - regardless their use in associations - have to pass that test. "John O'Gorman" is a valid element as well. BTW, the element level makes no distinction between entities, attributes or properties, etc. any more than a periodic table makes a distinction between the Carbon in flour and the Carbon in graphite.

Associations must pass the same test. Elements may be valid, but their association may not. If my birthday is January 13, 1913 then the association between "John O'Gorman" and 01-31-13 is accurate. Any other value in the Date of Birth field - with the notable exception of equivalent (Jan-13-1913) values - is not accurate.

Finally, the information I can put together in an accurate extension of associations must be directly derived from the 'facts': "As of January 10th John O'Gorman is an nonagenerian within a few days of his one hundredth birthday. As a valued customer we should send him a box of Cuban cigars to honour his accomplishment."

Abstracting things like 'accuracy' can only be done in the context of the value stream, with 'value' being assessed at every step.

Posted by John O | Tuesday, January 22 2013 at 1:59PM ET
Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.

Where do young IT professionals (30 and under) obtain information to aid with daily role responsibilities and career development?

Trade publication websites 14%
Social media 23%
Vendor websites 4%
Vendor/community forums 7%
Newsletters 1%
Trade conferences/meetups 2%
RSS feeds 6%
Web search 44%

 

Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.