We were recently asked to evaluate a large data set for data value anomalies as part of an overall data quality assessment that we hoped would establish the business case for senior management's investment in a data quality program. The particular data set we were examining contained a table with address information, and the model reflected a typical instantiation of an address table, with fields named AddrLn1, AddrLn2, City, State, ZipCode, ZipFour and Country. The client's data model had been purchased within the past 10 years.
This table was used to generate addresses for mail correspondence between my client and their customers. As can be expected, one of the more relevant anomalies focused on the difference between a mailing address that could have been composed from the corresponding data elements and the standard form for composing addresses specified by the U.S. Postal Service (USPS). While I am usually adamant about the fact that "data quality is more than just names and addresses," I do have to point out that there is a relevant data quality problem embedded within this form of table, and it is not address standardization. Instead, it is that the use of two fields to capture a mailing address allows data entry personnel to input information that can subsequently get "lost" between those two fields.
Let's take a step back. There are three particular USPS format standards that are frequently violated. The first is the suggestion that alternate location information such as apartment numbers, floor numbers or building identifiers should be placed above the delivery address line. For example, an address should have the apartment information above the street name, as in this example:
Mr. John Doe
100 Main St
Anytown, MA, 06554
The second formatting standard reflects the difference between the physical location (e.g., a street name and number) and a mail delivery address, such as a PO box. If there are different values for these, the delivery address is the one to be used for sending mail and should appear above the city and state in the actual address. The third standard concerns the "attention line," which refers to a specific person or department within the designated recipient. This should appear above the recipient line in the actual address.
The fact that there are two lines in which a mailing address is to be formatted when projected onto an envelope is probably what drives data model designers to make use of "AddressLine1" and "AddressLine2" to capture addresses. However, quite frequently it is not clear where to put the right bits of information. Typically, we would expect that if we only have recipient line information, it would be deposited into AddressLine1, and any auxiliary information would go into AddressLine2. However, if we were to project the correct formatting back into the data model, a single recipient line would actually go into AddressLine2 and AddressLine1 would be left blank. Because most people are not familiar with the USPS formatting standard, it is basically a crap shoot whether the auxiliary information is dropped into AddressLine1 or AddressLine2. The result of this confusion is that sometimes we end up with different records referring to the same street address. In one record, the street address is in AddressLine1; and in the other, the street address is in AddressLine2. This eliminates the ability to easily determine if the same address is duplicated within the table.
The last issue (the one that was most bothersome) had to do with the confusion surrounding the "attention line." What often happened was that a person's name appeared after the prefix "ATTN:" within one of the AddressLine fields; consequently, individual or departmental party names were showing up inside address fields. The result of this is that what could be important party identity and party relationship information was embedded within location information and, therefore, was effectively missing inside the data.
This is a good example of how misguided data modeling may increase the entropy within a set of information. The fact that this example came from a purchased data model is a bit startling because it means that my client is not alone in this situation. It is interesting to see how our expectations about the representation of information have changed over the 10 years of this model's existence. Back then, the goal was to be able to model an operational business need (managing mailing addresses), while today we are more interested in exposing the knowledge embedded within that information, perhaps to feed a business intelligence or customer relationship management (CRM) initiative. Upcoming columns will explore how to extract precious party and relationship information from legacy address or location tables.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access