A few months ago, I was reviewing the latest data profiling results (thank you data profiling vendors!) for three new data sources I needed to integrate into my enterprise customer data integration (CDI) hub. When I reached the ubiquitous state code field, I instinctively cringed when I glanced at the column metadata reports. In two of the files, state code was defined as a two-byte field, just what I would expect. But a review of the frequency distribution report showed the first file had 64 distinct values, while the second file had 67 distinct values. The last file was in worse shape. The column length was defined at a whopping 18 bytes with more than 260 distinct values. Now, Americans have been portrayed in the media as less than stellar when it comes to geography, but 260 state codes is, as my wife likes to say, just ridiculous. A quick check at Wikipedia (after all, I fit the media's representation of geographically challenged) confirmed my suspicions. At most, there should be 54 values, 50 states and four commonwealths. This number assumes you're overloading the meaning of "state" in your data model, which, arguably, is not good database design. A better database design would model states and commonwealths as separate columns, but I digress.
While I am happy to continue developing data integration logic to fix incorrect or inconsistent data to a conformed code set, I find myself asking why is this a data management problem? Where did the data or process rules break down in the original capture and verification of the address data to cause such an overwhelming inconsistency in the operational data? My experience has shown that making changes to data values in the warehouse to get data consistency is a recipe for disaster. In the case of numeric changes, you can lose your ability to reconcile metrics to the source system. For character data, the business has to learn new code values and meanings. And to exacerbate things, the problem is never fixed. So whose problem is data quality anyway?
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access