A few months ago, I was reviewing the latest data profiling results (thank you data profiling vendors!) for three new data sources I needed to integrate into my enterprise customer data integration (CDI) hub. When I reached the ubiquitous state code field, I instinctively cringed when I glanced at the column metadata reports. In two of the files, state code was defined as a two-byte field, just what I would expect. But a review of the frequency distribution report showed the first file had 64 distinct values, while the second file had 67 distinct values. The last file was in worse shape. The column length was defined at a whopping 18 bytes with more than 260 distinct values. Now, Americans have been portrayed in the media as less than stellar when it comes to geography, but 260 state codes is, as my wife likes to say, just ridiculous. A quick check at Wikipedia (after all, I fit the media's representation of geographically challenged) confirmed my suspicions. At most, there should be 54 values, 50 states and four commonwealths. This number assumes you're overloading the meaning of "state" in your data model, which, arguably, is not good database design. A better database design would model states and commonwealths as separate columns, but I digress.

While I am happy to continue developing data integration logic to fix incorrect or inconsistent data to a conformed code set, I find myself asking why is this a data management problem? Where did the data or process rules break down in the original capture and verification of the address data to cause such an overwhelming inconsistency in the operational data? My experience has shown that making changes to data values in the warehouse to get data consistency is a recipe for disaster. In the case of numeric changes, you can lose your ability to reconcile metrics to the source system. For character data, the business has to learn new code values and meanings. And to exacerbate things, the problem is never fixed. So whose problem is data quality anyway?

My first thought goes to the data modeler who designed the online transactional processing (OLTP) database. Clearly, a domain table of state and commonwealth codes should have been defined to enforce a common code set. Creating an 18-byte state code column is clearly an egregious oversight. Today's savvy users are quick to identify holes in the edit checks performed by applications and database logic. One just has to look to the quality and content of the veritable Social Security number or address line 3 to see the result of the lack of edits on screen fields. A text field on a screen with no edit checks is open season for abuse. If the field is long enough, you will start to see XML-ish type text to distinguish the multiple concatenated attributes the business added while waiting for the IT department to release the next version of the application.

My second thought goes to the application development team. There is often a big push by the software development teams to remove the enforcement of any business rules by the database management system. While this potentially absolves the data modeler of guilt, the fact remains that bad data made its way into the database. The application edit checks failed to recognize the 50 valid state codes or provide any text standardization conversions. For example, I can find multiple values for Texas, including TX, Tx and Texas. And as we all know by now, as long as data can be created, updated and deleted outside of the application logic, business rules built into the application will be bypassed.

My third thought goes to the quality assurance (QA) team. Depending on the size of the company, QA teams can be hit or miss. By that I mean larger companies tend to have dedicated QA teams to ensure the end product meets the original requirements. That's not to say data quality is better at large companies. One just needs to look to the price of data quality software to see whom the software vendors are targeting. Most likely, lost in hundreds of functional and nonfunctional requirements for an application, the fact that state code should be only two bytes in length and should conform to the USPS standard was overlooked. Without a specific requirement to test, a bad state code would pass QA with flying colors. More than likely, someone assumed everyone knew the 50 state codes and that writing validation code was a waste of time. After all, everyone knows the state abbreviations for Michigan, Minnesota and Missouri. (Don't feel bad if you have to check - I did.)

My final thought goes to the business users. At a recent client engagement, one of the senior data architects said data quality at their company is an afterthought. The business had decided early on that data integrity issues such as bad address data and invalid personal identification attributes should not be constraining business rules. These data items, while critical pieces of information, were not important enough to validate and enforce validity or conformity at the point of origination. Bad information was captured and passed on to the next application that assumed the first application had done its job. Voilà - bad data is now persistently stored in multiple data stores.

Which brings me back to the beginning. Here I sit enforcing the business rules that the OLTP modeler omitted, the application team didn't implement, the QA team overlooked and the business decided wasn't important enough. Data quality isn't just a data management problem, it's a company problem.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access