Although I have been saying for years that data quality is not all about names and addresses, I don't want people to think that name and address quality isn't a big part of the data quality process. On the contrary, any organization that needs to deal with customers is bound to have problems with the contact information for those customers. There are two growing trends that warrant a closer look at name and address quality: enterprise information integration (EII) and customer data integration (CDI).

Both of these techniques focus on building a means for consolidating information from across the enterprise in a way that reduces entity identification into a single view. The EII approach provides the means for accessing disparate data sources in place, reflecting an integrated view of data without actually moving it. Alternatively, the goal of CDI (and other master data management approaches) is to collect and aggregate entity detail (be it customer or other type of reference object) into a single repository as a single source of truth. Even though these might be "competitive" approaches, they are similar in their need for applying name and address parsing and standardization in order to provide that single view, whether it is a repository-based view or a virtual view.

In fact, in these kinds of environments, one might consider that name and address quality become the most critical component of the process. If one of the goals is to be able to provide a unified view of the individuals whose information is recorded in various databases across the enterprise, then the inability to recognize and resolve aliases and variations into a single entity will ultimately defeat the purpose.

We clearly need to have name and address parsing and cleansing as part of our EII or CDI processing. The challenge is this: traditionally, name and address cleansing has been seen as a batch process followed by a series of interactive review sessions: the data is extracted, the data sets are compared to each other and the result is trisected into those records that are definitely matches, those that definitely don't match any others and questionable matches that require manual review. However, the synchronization and potential real-time demands of applications that rely on an EII or CDI platform will not tolerate mountains of record pairs destined for the analyst's screen. Yet those manual review records are the ones that carry the most value, because merging the obvious duplicates is an automated no-brainer. Matching the ones that are too close to call is the one process that really needs to be automated!

This poses two challenges to the data quality tools community. The first lies in modifying the typical approach from the batch process to a more services-based process. The real challenge lies in being able to aggregate the meta knowledge necessary for performing the matching process; in other words, an application faced with a new customer record needs to be able to scan the set of potential aliases without necessarily having access to entire extracted data sets. Yet because duplicate and householding applications employ the variant data for the purpose of entity resolution, the absence of this data is likely to hobble the process. The challenge, therefore, is to maintain the variant/alias knowledge without needing to hold onto all of the data.

The greater challenge is to incorporate a degree of automated trainability into the data quality application. If the true bottleneck is the manual review, providing a means for an application's ability to internalize the data analyst's approaches to decision making would allow the application to "learn" and consequently become less dependent on the analyst. It is likely that the kinds of corrections applied by the analyst are neither random nor sophisticated.

I suspect that some of the vendors are already implementing some of these capabilities, although I have yet to see a true integration of any kind of knowledge discovery or machine learning applied to data quality analyst processes. Still, I am confident that as the minute name and address differences that will clog up either an EII or CDI project become more acute problems, data quality tool vendors will take on these challenges to gain the competitive advantage.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access