How to know when data is 'right' for its purpose
The concept of measuring and evaluating the quality of data has been part of IT’s portfolio for as long as data has been captured and stored. As the capability has matured, it has evolved and improved allowing for distinct measurement and evaluation along very specific and discrete categories: accuracy, completeness, consistency, integrity, timeliness, uniqueness and validity.
While each category can present its own challenges in any organization, accuracy seems to be the most common. The business user will invariably ask the simplest question of all: “Is my data right?”
Merriam-Webster defines accuracy as follows:
1 : freedom from mistake or error : correctness checked the novel for historical accuracy
2a : conformity to truth or to a standard or model : exactness impossible to determine with accuracy the number of casualties
2b : degree of conformity of a measure to a standard or a true value
Based on the second half of the definition, accuracy is measurable providing there is a set of expected values available. This is readily applicable to data elements driven by reference data (country codes, state codes, zip codes, product codes, etc.) or cross validation with other system data (customers can be cross-referenced to a CRM, for example). This applies only to the individual element, however. I have seen issues where each element is accurate (city, state, and country, for example) but the combination is decidedly inaccurate (Buffalo, NY, Canada).
However, the first portion of the definition can create an impossible situation for IT. While we can validate that a customer address is formatted correctly, that it is “mailable” per the local postal authority, and is current with street names, numbers, etc. What IT cannot do is confirm or deny that the customer actually lives there or receives physical correspondence at that address.
Another example of this would be customer name. For any business, the customer name is simply what has been captured in any given system. The accuracy of the value is dependent on the method of entry. If it is manually entered, there is always a risk of incorrect spelling or duplication. If it systematically entered, you are inheriting the risk from the source system.
In this instance, accuracy can only be measured within the system itself and in most cases, there is no way to systematically validate that a person’s name is correct. Is the source to compare the name considered “more accurate?” In these days of ever-increasing privacy regulations, what systematic option would be available outside of your environment?
All is not lost, however. There are certainly scenarios where IT can answer the “right data” question with a confident yes or no and with only the most minor qualification. That is with metrics and calculations because there is always a right answer when math is involved. The qualification would be that IT has the correct definition and of course the underlying data has been populated consistently (note, I did not say “correctly”).
Another option of working through this challenge is to clarify the expectation of the business user. Asking a few more questions to ascertain the true need and the reason behind the question can help frame the answer tremendously.
Is the question based on previous instances of “bad” data? Again, “bad” data is relative and is always from the perspective of the business user. If so, then framing the response to highlight improvements in the consistency and validation of the source data may reassure and meet the users’ needs. Maybe the question is related to reference data that had not previously been governed or monitored. If so, walking through the steps taken to evaluate validity against a set of expected results (and the source of those values) will start building the confidence in the final product.
Finally, it may simply need to be a conversation about whether data is “fit for purpose.” If marketing is looking at market penetration at the state or country level, does the street address truly need to be 100% correct? Or is the requirement that the combination of city, state and country be accurate? If that is the case, how should IT treat the combinations that do not conform to the local postal authority? Which element is the most accurate based on source profiling?
As usual, the key to evaluating the accuracy of the data is more about understanding the eventual use of the data than any arbitrary or independent measure.