AUG 1, 2005 1:00am ET

Related Links

10 Sustainability Predictions for 2011
February 23, 2011
A Letter to Future Employees: Embrace Analytics
February 3, 2011
A Hunger for Risk
January 6, 2011

Web Seminars

Creating a Sense of Application Awareness in IT Virtualization Environments
Available On Demand
Transforming Processes to Achieve Greater Agility and Efficiency
Available On Demand

Fuzzy Data Integration, Part 1

Print
Reprints
Email

How do you deal with integrating data from heterogeneous sources that do not share common keys?

One of the greatest challenges and most time-consuming efforts in a data warehouse project is integrating data between disparate data sources into a meaningful collection of information for users to make decisions. Significant resources need to be applied to understanding not only the data and structures, but also gathering and analyzing the meta data of source systems to fully appreciate the context of the information. Further efforts may also need to be utilized or developed if the data exists in nonstructured or proprietary formats, exasperating an already very complex process. Once this basic step of understanding and accessing your data sources is somewhat manageable, you need to determine the best approach to integrating this data. Depending on your business requirements and source system data state, integrating data from these systems can be exigent if common data columns exist between the applications. The task becomes even more difficult if common elements do not exist between source systems and new, inexact methods need to be designed and developed in order to meet business needs to integrate this data.

This column is the first part of a two-part series examining a data integration strategy for matching and integrating data between heterogeneous source systems when no common elements exist to facilitate the process. In this installment of the series, we will look at some typical data integration approaches and provide an example case. In the second part of this series, we will look at the logic needed to implement this data matching method and some of the details that need to be considered before attempting this process. We are making the assumption that the data quality and data integration product or method being used by your firm does not have this type of capability.

Data integration into the data warehouse typically occurs through one of several methods. Your data integration process should support these methods whether used individually or in combination to improve data quality in the data warehouse environment. Data validation should be able to be performed on data originating from a particular source system to identify and mark poor quality records. Validation can include use of lookup tables (e.g., ISO code standards) or data type checking (valid numeric, date and null) to qualify source records for inclusion in further processing to the data warehouse. Records that do not qualify are excluded from the primary data integration workflow and rerouted to an exception queue for alternate processing or review. The first data integration process method consists of cross matching records from several source systems by same or similar key fields (e.g., employee ID). Similar key field matching may consist of some type of masking of the comparison fields to eliminate character differences (e.g., SMITH100321  <-> SMITH, JOHN using SMITH*). The second method involves selection of record values based on the most common values occurring across the various source systems (e.g., last name Smith exists in two of the four source systems). Third, new target column values can be selected based on a prioritized hierarchy of the source systems (for example, use call center customer address data over billing unless call center data for customer X does not exist).

Because most firms do not use the same keys across all of their systems, fuzzy matching can be used on several similar columns on each source system looking for closely matching patterns such as similar last names (e.g., Smith versus Smyth) or phone numbers (e.g., 312-555-1212 versus 312-555-1221). Alternately, a pattern matching process can be set up to reconcile the dissimilar column values to integrate the data. In next month's column, we will look at one approach to this logic.  

Michael Jennings is a recognized industry expert in enterprise information management. He has more than twenty years of information technology experience in the healthcare, telecommunications, insurance, banking, manufacturing and logistics, and human resources industries and has experience within governmental organizations. Jennings has published numerous industry columns/articles for EIM Institute, Information Management and Intelligent Enterprise magazines and others, and has been a judge for numerous awards. He speaks frequently on enterprise information management issues and practices at major industry conferences, and has been an instructor of information technology at the University of Chicago's Graham School. Jennings is a co-author of the book "Universal Meta Data Models" (2004) and a contributing author to the books "Building and Managing the Meta Data Repository" (2000) and "The DAMA Guide to the Data Management Body of Knowledge" (DAMA-DMBOK, 2009).

Filed under:

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.