In his book Data Quality: The Accuracy Dimension, Jack Olson explained that data accuracy refers to whether data values are correct. To be correct, Olson argued, a data value must be both the right value and be represented in an unambiguous form, which is why he declared the two characteristics of data accuracy are form and content.

Form

“Form is important because it eliminates ambiguities about the content,” Olson explained. Form dictates how a data value is represented, and Olson used his birth date (December 13, 1941) as an example of how you can not always tell the representation from the value. If a database was expecting birth dates in United States representation, a value of 12/13/1941 would be correct, 12/14/1941 would be inaccurate because it’s the wrong value, and 13/12/1941 would be inaccurate because it’s the wrong form since it’s in the European representation where the day is followed by the month.

In the case of February 5, 1944, the United States representation is 02/05/1944, whereas the European representation is 05/02/1944, which could be misunderstood as May 2, 1944. Because of this ambiguity, a user would not know whether a birth date was invalid or just erroneously represented. “A value is not accurate,” Olson explained, “if the user cannot tell what it is.”

Content

As for content, Olson explained that “two data values can be both correct and unambiguous yet still cause problems.” This is a common challenge with free-form text, such as a city name. “The data values ST Louis and Saint Louis may both refer to the same city, but the recordings are inconsistent, and thus at least one of them is inaccurate.” Consistency is a part of accuracy, according to Olson, because “inconsistent values cannot be accurately aggregated and compared. Since much of data usage involves comparisons and aggregations, inconsistencies create an opportunity for the inaccurate usage of data.”

Validity versus Accuracy

“The definition of a value being valid,” Olson explained, “means simply that the value is in the collection of possible accurate values, and is represented in an unambiguous and consistent way. It means that the value has the potential to be accurate. It does not mean that is accurate. To be accurate, it must also be the correct value.”

“Defining all values that are valid for a data element is useful because it allows invalid values to be easily spotted and rejected from the database. However, we often mistakenly think values are accurate because they are valid. For example, if a data element is used to store the color of a person’s eyes, a value of Truck is invalid. A value of Brown for my eye color would be valid but inaccurate, in that my real eye color is blue.”

Can 100% Data Accuracy be achieved?

“The short answer is no,” Olson explained. “There will always be some amount of data in any database that is inaccurate. There may be no data that is invalid. However, as we have seen, being valid is not the same thing as being accurate.” Olson noted it’s rare that an application would demand 100% accurate data to satisfy its business requirements, which is why he considered “the long answer is yes. You can get accurate data to a degree that makes it highly useful for all intended requirements.”

This blog was originally posted at OCDQblog.com. Published with permission.