The past several columns have focused on name and address matching technologies for consumer data. However, many projects require matching of businesses customers, suppliers, distributors or other sorts of business partners. Business matching faces all the challenges of consumer matching and some of its own.
The additional challenges fall into two main groups. The first relates to the complexity of business data. While consumer records usually have a relatively simple name and address, business records often also contain contacts, titles, departments, building names and mail stops, etc. This information is in different formats from system to system and, often, from record to record in the same system. The matching software (or, more strictly, the parser) must identify these elements correctly, or at least know which strings can safely be ignored. This requires the recognition of common words, patterns and relationship terms such as "DBA" for "doing business as" and "MS" for "mail stop." The standardization portion of the system should recognize these terms and other common words and abbreviations, such as "co," "corp" and "corporation," and place them in a consistent format. Standardization tables should also recognize and adjust alternate forms of company names such as "IBM" and "Intl Bus Machines" for "International Business Machines." Finally, matching routines should give less weight to words that are common to many businesses. For example, even though two of the three words in the following names are identical "Jones Marketing Corporation" and "Smith Marketing Corporation" these are almost certainly two distinct companies.
Daunting as these complexities may be, they are still handled with conventional text-analysis techniques: scanning for key words and patterns, inferring word types and relationships, and looking for similarities among individual data elements. Given specialized data tables, pattern sets, parsing rules and matching methods, these techniques identify records that contain variations of the same basic information reasonably well. Yet, they are helpless in the face of the second challenge facing business matching systems: the need to link records that are fundamentally different.
Many firms conduct business under multiple names or have subsidiaries with names unrelated to their parents. Businesses may also have different locations, and even the same physical location may be served by different street addresses and post office boxes. There is no way for any method based in text comparison to recognize that records with such differences are related.
Linking fundamentally different records requires a preexisting list of business locations and their relationships. This is a form of reference-based matching, which compares input records to a separate, comprehensive list of entities instead of comparing the input records to each other. For consumer records, reference-based matching provides a marginal, though significant, improvement over text-based methods. Its main benefit is linking records for people who have moved or changed their names as a result of marriage or divorce. However, business records have many more reasons for text-based matching to fail; therefore, reference-based matching is almost essential.
The key to reference-based matching is the reference database. For businesses, the primary source is D&B. D&B assigns each business location a unique identification number, called the D-U-N-S number (for Data Universal Numbering System). The firm also maintains a hierarchy of relationships among sites, storing the D-U-N-S number for several levels of corporate parents and whether the site is an independent location, branch or headquarters. This allows users to identify related businesses even when there is no obvious connection between the names on their database records. D&B can also store several trade names for each site in its database. The system does not store alternate addresses for the same site for instance, a street address and a post office box. It will instead assign each address its own D-U-N-S number. These numbers would be linked through a common corporate parent.
D&B makes its reference database widely available, providing matching software and services itself and licensing the database to third parties. D&B encourages independent use of the D-U-N-S number, allowing firms to append it to their own files and use it to match against other firms' data. In fact, the D-U-N-S number has been adopted as a standard identifier by a number of trade groups, governments and international agencies. This contrasts sharply with current vendors of reference-based products for consumer matching which severely limit the use of their standard IDs for business and privacy reasons.
Naturally, the D&B database does not solve all business matching problems. Because it works at the site level, it does not distinguish between specific departments at the same site or between specific individuals, nor can it track individuals as they change jobs and titles (although D&B does capture some executive names). In combination with conventional text-based matching techniques, aggressive data-quality efforts and modifications of key word and standardization tables to reflect specific industries, reference data allows users to deploy a reasonably effective solution to the challenge of business matching.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access