What is the Importance of Accurate Name Matching?
The sheer volume of data inside the company, multiple points of interaction a customer has with the company (for different products and through different channels) and increasing regulations related to data privacy/verification have created significant complexity in extracting complete information for each customer and deploying the same for successful business decisions.
Companies have deployed several initiatives in data warehousing, customer relationship management (CRM) and customer data integration (CDI) applications to meet the just-mentioned challenge. However, a lot of these initiatives have suffered from the cliché, yet very apt "garbage in, garbage out" syndrome.1 This is due to the nonstandardized format of data residing on multiple platforms and weak matching capabilities in linking these nonstandardized formats together.
Inadequate matching across nonstandardized formats results in incomplete/inaccurate extraction of customer information that affect several business decisions. The accuracy of decisions related to up-sell, cross-sell, mailing, customer complaints handling, having an integrated view of customer, etc. is severely impacted due to inadequate matching. Effective entity matching (and within it, name matching) has thus assumed significant importance in recent times.
Where Does Complexity in Name Matching Come From?
Given the importance of name matching in several business decisions, one would presume that organizations would have invested heavily in tools and techniques for name matching and sorted it out. Right? The truth is, companies have invested in tools (either as part of CRM implementation or customer data integration/business intelligence initiatives) to do name matching (and in the process name standardization). However, the challenge faced by the companies in their name matching tools is the inability of these tools in handling wide variety of naming conventions represented in modern databases, which reflect the variations in the composition and structure of diverse economies.
If one were to look at the English-speaking world alone, the predominant model for name is a given name, an optional middle name and a surname. Here the challenge is mainly to address the variant forms such as Anthony Brock versus Tony Brock, etc. However, if one were to look at names in the non-English speaking world, several imposing variations introduce major challenges for a name matching tool. Some of the common issues that arise with names from around the world are:
- In China, the surname comes first, before the given name. Some people may maintain this format in Western contexts, others may reverse the name order to fit the Western model. The problem is compounded further if a Western given name is added, e.g., Yi Kyung Hee ~ Kyung Hee Yi versus Kathy Yi Kyung Hee versus Yi Kathy Kyung Hee versus Kathy Kyung Hee Yi.
- In India, multiple models of writing names exist. In some parts, first name comes first while in other parts, second name comes first. There are further variations when people add the bearer's lineage or even place of birth. In addition, complexities arise, when people use abbreviations instead of complete names, e .g., Sankaran Karthik versus Karthik Sankar versus S. Karthik.
- In Arabic names, for example, the letters K and Q can be used interchangeably; Qadafi and Kadafi are variants of the same name. This is not the case in Chinese transcriptions, where Kuan and Quan are most likely to be entirely different names.
- Names may contain various kinds of affixes, which may be added to the rest of the name, separated from it by white space or hyphens, or dropped altogether, e.g., Abdalsharif ~ Abd al-Sharif ~ Abd-Al-Sharif ~ Abdal Sharif; al-Qaddafi ~ Qaddafi.
Name matching algorithms tend to take a one-size-fits-all approach, either by underestimating the effects of cultural variation or by assuming that names in any particular data source will be homogenous. This gives reasonable results for names that fit one model but performs poorly with names that follow different conventions.
Let's start with the fundamentals. Name matching algorithms are primarily based on two principles of standardization (nicknames recognition, etc.).
- Compare the string to match spelling of query and candidates. The algorithms calculate the number of single character additions, deletions, transpositions and replacements required for transforming candidate into query. There is a cost associated with each of these operations. Lower costs of a candidate imply a better match with query.
- Compare the way the query and candidate sound. The algorithms identify the sound of query and retrieve matching candidates with similar sound. These phonetic algorithms are more widely used for name matching. One of the popularly known algorithms used for this purpose is Soundex.2It uses codes based on the sound of each letter to translate a string into a canonical form of at most four characters, preserving the first letter.
The challenge in both the principles is exactly as described in the previous section. In their standard form, these principles do not take into account the cultural variations. For example, Soundex has the following shortcomings:
- It transforms dissimilar sounding names to the same code in the case of long names with the initial few letters the same. This results in inaccurate matching of candidates with the query.
- It removes vowels while generating the code for names. This renders it ineffective for naming conventions where consonants are not important differentiators. A good example of this type of problem is the example of Lee and Leigh. Removing vowels causes Lee to be given a code of L000 while Leigh gets a code of L200, even though both are phonetically same.
- Further, it has no tolerance for random typos.