Underlying every grand plan for customer relationship management (CRM) is a centralized customer database ­ one that consolidates information about each customer from sources throughout the company. This complete picture of each customer is the foundation for understanding what actions will get the greatest value from the relationship.

No one with meaningful CRM experience would minimize the effort needed to build such a database. In fact, most practitioners quickly acknowledge that it is by far the greatest technical challenge of a new CRM project. (Actually, this is not quite true. Vendors of integrated front-office systems often assume their standard operational databases will be the central customer repository. This breezy confidence typically crumbles when they are told they must integrate data from back-office operations and from the touchpoints ­ there are always a few ­ that run outside of the main front-office system.) Still, once the difficulty of building the database is solemnly noted, attention usually swings to more exciting tasks such as choosing vendors and fighting political battles. The nuts and bolts of customer data consolidation are set aside as problems to resolve during implementation. The unspoken assumption is the available tools all give roughly equivalent performance so there is no point in conducting a detailed assessment of them.

In an information technology (IT) industry where people have strong opinions regarding all conceivable ­ and a few incomprehensible ­ technical issues, this relaxed indifference is a remarkable anomaly. Though it's tempting to see it as evidence of hitherto unsuspected reserves of human rationality, it more likely reflects ignorance of the issue. Inept marketing by vendors who have failed to effectively distinguish their products could play a role as well.

In fact, there are significant differences among customer data consolidation tools and techniques. These relate to the very specific task of matching customer names and addresses ­ an esoteric process that is unfamiliar to most corporate IT groups, even if they have experience with the types of consolidation required for non-customer data. The difference is that most non-customer consolidation revolves around exact matches, conversion rules or translation tables. These processes are by no means trivial; researching, building and maintaining them is a major task in a company with many complex systems. However, they do ultimately produce a set of rules that unambiguously determine whether or not two records match (at least in most cases).

Name and address matches are inherently less certain. There is often no way to know when looking at two customer records whether or not they refer to the same person. The names are spelled similarly, but is it a real difference or a data entry error? City names differ within the same postal code, but is the code wrong or is one city name a colloquial variation or vanity address? The same name appears at two different addresses, but is it two people, one person who moved or one person with two addresses? Even a unique identifier such as a Social Security number can be misreported, miskeyed or just plain missing. As privacy concerns and regulations accumulate, identifiers such as telephone and Social Security numbers will be less available, so using them as a matching shortcut will be even less effective. As life becomes generally more complicated, people have more non-matching attributes. How many phone numbers do you have? How many e-mail addresses? Do you receive mail at a post office box for privacy or business reasons? Let's not even get started on households or business matching.

In short, there is no way to create a straightforward mechanical process for name and address matching. Systems to provide approximate matches do exist. In fact, there are three levels of such systems, each building on the foundation of its predecessors.

The most basic matching systems were built to identify (merge) and remove (purge) duplicate names on mailing files. Major vendors are Group 1 Software and Firstlogic (Postalsoft); other competitors include Sagent and SAS's DataFlux subsidiary. Low-end, PC-based alternatives are available from Mailer's Software and Peoplesmith.

What these systems essentially do is compare one record with another. It would be horribly ineffective to simply treat each record as one large string and compare the strings to each other because there are so many variations in how addresses may be formatted that legitimate matches would be rejected because the strings didn't line up. Therefore, merge/purge systems first split the input records into standard fields such as first name, last name, street, city and state. There are usually about a dozen such categories including title (Mr., Mrs.), generation (Jr., Sr., III), street type (St., Ave., Blvd.), directional (North, South) and apartment number.

Often the input record is already split into such fields ­ hopefully, the data was captured that way in the first place, as this is by far the most effective approach. If not, the merge/purge system will parse the record into components, looking for key words, standard formats (e.g., a five-digit string is probably a ZIP code), positions within the record (usually the name line comes first, then the street line, then the city/state/ZIP) and positions within each line (usually the first name comes before the last name). Parsing is not perfect, particularly when records in the same file have been entered in different formats (e.g., a mix of last name first and first name first) or when business addresses are involved, but most systems can correctly parse most records.

The second preparation step usually involves standardization. This mostly involves looking up words and their equivalents in huge translation tables. Some standardization changes nicknames and variations to a standard name, (e.g., Elizabeth, Liz, Beth and Betty are all changed to Elizabeth). Titles might also be standardized to change Mister to Mr. In addition, postal standardization is applied to ensure street and city names are spelled consistently and, when possible, to ensure that the postal code matches the rest of the address. This requires more than simple table translations. In fact, postal standardization involves complicated parsing, string matching and validation processes which are generally embedded in systems outside the merge/purge product. Postal standardization is sometimes run before the merge/purge process begins; the output would be a file in which the postal elements were parsed and standardized, although the merge/purge system would still parse and standardize the name line.

Once the data is parsed and standardized, the merge/purge system sorts it to bring together records that are likely to be matches. This avoids having to compare every record to every other record, which would be cost-prohibitive. Most systems generate a sort key based on components of selected elements ­ for example, the first three characters of the ZIP code, first three letters of the street name and house number. Some systems generate multiple keys and run through the file several times in the different sequences. This avoids missing matches because of a flaw in one element of the key, such as a bad ZIP code.

In older merge/purge systems, the matches were often identified by a match key that was essentially an extended version of the sort key. For example, it might have the first three characters of the ZIP code, first three letters of the street name, house number and first, third and fourth letters of the last name. Records with the same key were assumed to be actual matches. While very efficient from a processing viewpoint, this method is not very accurate ­ it rejects records because of minor differences and accepts records that are obviously different even though the keys were identical. Changing the composition of the key usually means trading false matches against missed matches without reaching a satisfactory level of both. This method is no longer used by major merge/purge products, although it still sometimes appears in less sophisticated systems.

Today's standard approach is to move through the sorted file comparing groups of records. One method is to compare all records within a certain distance of each other (e.g., up to 10 records away). Another is to compare all records within a "break group," which is set of records sharing key elements such as ZIP code and street name. The break group method is more flexible because it will look at all similar records even when there are many sharing a particular set of values. Some systems limit the size of the break group itself, in which case even adjacent records may not be compared if they fall on different sides of an arbitrary intra-group split.

The comparisons measure the degree of similarity between each pair of elements within the parsed records and then apply a business rule to determine whether this combination of similarities constitutes a match. Similarity is determined by comparing the strings of text within each element. Of course, exact equality is easy to find, so the challenge is to identify and rank near matches. Systems may calculate the number of characters that are the same, check for sequences that are the same, adjust for transpositions or even allow for letters that are adjacent on a typewriter keyboard. Some systems look for phonetic equivalents. Some preprocess the string by removing vowels or double letters. Some treat numbers differently. Different matching methods are applied to different field types; it makes no sense to apply phonetic matching to a ZIP code. Sometimes the user can define the algorithm that applies to each field; in other cases, the algorithm is predetermined. Generally, the algorithms produce a score that indicates the closeness of the match between the two strings. Sometimes the system automatically combines the element scores into a record-level score, and the only job of the user is to decide what record-level score will count as a match. In other cases, users specify what scores count as element matches and how these are combined to qualify for a match. For example, a rule may be that the system must match on last name, street address and at least five of eight other elements. Even systems that allow precise user control will provide default settings that the majority of users accept in practice.

The trial-and-error labor needed to fine-tune match settings is rarely worthwhile in the world of direct-mail list processing where many different lists are processed together and time is usually short. Companies building a customer database may be more willing to invest in improving long-term results by tailoring the system to the quirks of their own source files.

Once the matches are identified, a merge/purge system is designed to choose one record to keep and discard the others. In mailing list preparation, the selection often depends on which source would charge the least to use the record. Merge/purge systems generally let users specify list priorities and sometimes have distribution functions to randomly allocate matches across sets of source files. These systems also have standard reports to show duplication across input systems, again to help with direct mail analysis and list rental payments. Matching for in-house customer databases has a different set of concerns. Users are more likely to require tools to consolidate data from the different sources and to pick the best information where conflicting data appears. Such functions are not necessarily available in merge/purge systems. They are, however, important capabilities of the more advanced customer matching systems that will be discussed next month.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access