Continue in 2 seconds

The Common Problem – Working with Merge/Purge and Householding

Published
  • July 08 2004, 1:00am EDT

Of all the issues related to accumulating and identifying unique data during a data warehouse implementation, perhaps the single most difficult area to control centers around data quality issues. And perhaps the most difficult task that must be accomplished once the data is cleansed and deemed "good" centers around the topic of how to do merge/purge or householding. Householding consists of grouping like data from multiple sources. It is the act of uniquely identifying a set of data records from one source system, such as order entry, and how they are related to another set of data records from the company's order fulfillment system. Householding is perhaps the single most major effort during the ETL process, and it is a task that many companies will underestimate. It takes significant time and effort to accomplish, and it takes even more effort to insure that it is done correctly.

Householding is the process of identifying related information from multiple sources and storing it in the data warehouse such that all related information about a given party is accessible when looking at any information about the identified party. A key point is that householding does not simply refer to the matching up of data about a family. The need for householding began with a requirement for clear and concise marketing data. As such, the goal was to identify all individuals that reside in the same physical residence, hence the name householding. A marketer could now send one piece of mail to the household, which saved considerable dollars in postage and mail preparation costs. Today, householding refers to any grouping of information about a given person, family, household or company. The need today for accurate identification of the same person, family, household or company across all systems is significantly greater than when householding was first identified as a required business rule for data migration. Today's marketing programs target people at many levels - as an individual, as head of household or as a member of a given group of people (i.e., club, residence and/or business).

When first considering the effort to perform householding, it is easy to assume the effort will be fairly insignificant. Nothing could be further from the truth. Before getting into specific issues for householding, let's take a quick look at the checklist of necessary tasks for any form of householding. First, all data in the data warehouse must be scrubbed and cleansed. Secondly, you must come up with standardization rules for matching, including exact data attributes to be used for matching. Third, run the match logic on the incoming records and check the data in the warehouse. Sounds easy, right? For the purposes of this column, we'll assume all data is clean.

Let's look specifically at householding in terms of grouping data from multiple sources about a given individual. Everyone has a social security number (SSN), and every SSN is a unique identifier. Therefore, match people by SSN and you are done. It would be except that historically very few systems capture and store SSN for a person. And those that do rarely validate the SSN against any official sources. When householding a given person's data from many sources, it is clear that the effort is much greater than first anticipated. Unless, of course, you are using data sources that have all been written in the past five years or so which have all been standardized to require the entry of SSN. Unfortunately, it is probably safe to assume a more likely scenario. While some source systems carry the SSN, because none of the records are standardized, we have to dig a little deeper for the additional identifying information.

Other common data elements are: surname, given name(s), other ID numbers, address fields, phone number(s) and employer's name. We include ID number in the list as some companies may have had the foresight to assign a unique identifier to a person's data records and carried this number forward to all systems. It is unlikely to be any more reliable than SSN, but it is worth investigating. Can anyone identify the first concern with the list of candidate matching elements? Remember, we said to assume all the data was clean. But, a clean U.S. address is simply one that follows valid address form - number, street, apartment, city, state and ZIP. Likewise, a valid U.S. phone number contains 11 digits - country code (1), area code, number. Further, it may include an optional extension that is often stored in the same database column as the number itself. The issue is one of name and access standardization.

Standardization must occur at every candidate data element, including name. Look at our names - Greg Mancuso and Al Moreno. Every day we receive letters addressed to Greg Mancuso, Gregory Mancuso, G Mancuso, Al Moreno, Albert Moreno and A Moreno. When one compares Greg Mancuso and Gregory Mancuso, will these two names be considered a match? More often than not, an automated routine will say "no," but a human being looking at the same names will likely say, "Maybe, but I need more information." The key point here is that business rules must be set that will guide the householding routines on what is a match and what is not.

The same holds true for address fields. In many ways, addresses are more difficult to standardize. The hardest part about standardizing addresses is to ensure the elements are uniquely identifiable. That is, the address is not stored in a 100-byte freeform text field in the database. Once the address elements - there is an attribute for house number, for street name, for city, state and ZIP - are stored in discrete fields, there are a variety of ways to ensure the addresses are standardized.

Now, assuming all candidate elements are cleansed and standardized it becomes a much easier task to identify matching elements. Business rules may be put in place to guide the logic and determine how closely all elements must match for the records to be considered a match. For example, in a householding routine with a business rule of 100 percent match or there is no match, the following two records would not match: Greg Mancuso of 123 Main St., City, ST 12345-1234 and Gregory Mancuso of 123 Main St., City, ST 12345-1234. On the other hand, a business rule of last name and address must match or no match would identify the above records as a match and merge the detail information into the same person in the target data warehouse

From this simple example it becomes clear that householding is a difficult task that must be dealt with during the ETL process. It is also a necessary task. The cost savings to a marketing group alone will likely pay for the effort with the first couple of mail campaigns, especially given the cost of postage and preparation of mail pieces that are sent to prospects today. But, it is not just marketing that benefits - all groups within an organization stand to gain. Customer service is better able to get a complete 360-degree view of their customer base thus providing an improved CRM prospective; sales can better understand the true cost of sale and cost of acquisition (how many pieces, how many phone calls, etc.); and management can get a clear understand of the key financial costs related to the various groups when looking at cost of service, cost of sale, cost of product by prospect, cost of product by customer, etc.

While this column deals with some basic rules of householding, future columns will discuss some alternatives for performing householding and delve further into the issues surrounding householding activities.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access