Q:

We have a project that requires us to integrate up to 10 legacy customer-centric databases after a merger to build a customer marketing data mart. The objective is a data mart that gives a complete picture of an individual customer and his or her dealing with the company so as to do effective marketing to targeted customers. A current constraint is that the organization will continue to maintain and operate the existing legacy systems. The major problem for us now is how to create and maintain the customer dimension, since the same customer can reside in multiple sources, each with a potentially different identifier, key fields and even with different data for the same key fields. Moreover, how should we handle the slowly changing customer dimension since an update in one of the source databases could eventually be overridden by another, even if the latter is not as current as the former. Is there a way to ensure that the most updated customer information will eventually end up in the integrated customer dimension in the data mart? Any useful design tips would be greatly appreciated.

A:

Scott Howard’s Answer: Wow, to adequately discuss and solve your problems would require a full length technical book. Until I write and publish that book, I feel that most of you problems can be handled with an adequate yet complex ETL system. First. it appears that you'll need good matching technologies. Check out the white paper titled "Probabilistic Matching Engines for e-Business" located at: http://www.ascentialsoftware.com/litlib/dq.html. Second. it appears that you need a way to consolidate changes before loading your dimension. Consider using a staging table or file before loading your changes. The matching engine will help you reconcile your keys, the staging mechanism will allow you to consolidate changes to those related keys.

Michael Jennings’ Answer: There are several integration methods you can use individually or in combination to insure data quality in your data mart. Data validation can be performed on data originating from a particular source system to identify and mark poor quality records for elimination into the data mart. Validation can include use of lookup tables (e.g., state or country) or data type checking (valid numeric, date, null) to qualify legacy records for inclusion in further processing to the data mart. Records that do not qualify are excluded from conversion or rerouted to an exception queue for further processing. Next, records can be cross-matched from several legacy systems either by same/similar key fields (e.g., SSN). Since most firms do not use the same keys across all of their systems, fuzzy matching can be used on several similar columns on each legacy system looking for closely matching patterns such as last names similar (e.g., Smith vs. Smyth) or phone number (e.g., 312 555-1212 vs. 312 555-1221). Next, new data mart record values can be selected based on the most common values occurring across the various legacy systems (e.g., last name Smith exists in 8 of the 10 legacy source systems). Alternatively, new target column values can be selected based on a prioritized hierarchy of the ten source systems for example use call center customer address data over billing unless call center data for customer X does not exist).

Joe Oates’ Answer: This is one of the difficult issues when transitioning to enterprise information. Because a customer may be maintained in more than one system, there is a high probability that there are disparities in the data for the same customer across the different systems. The first step generally is to get all of your customers on a single tape, disk, etc. and contract with a third party whose business is to provide the most accurate information about people and organizations regarding address, demographics, employment information, etc. to cleanse and provide this information back to you. You can then have your "enterprise customer data mart" that will be as accurate as possible. You should understand that these third- party companies cannot guarantee that they can provide accurate information for all of your customers, but they historically have done so for 80 percent or more of the customers in projects that I have been associated with. Of course, you will have to transform the information in the customer data mart and send it to the appropriate customer-centric database. You must then establish a closed-loop procedure that when a new customer is entered, updated or deleted in one of the 10 legacy databases that the enterprise customer data mart determine if, in the case of the "new customer" transaction that the customer already exists. In the case of the "update customer" transaction, how many of the 10 legacy databases will be affected. And in the case of a "delete customer" transaction, whether any of the other legacy databases have the customer on file. You must also have cross-reference tables in your staging area that allow you to keep track of how may of the 10 legacy databases maintain each of the customers.

Establishing a single source of the truth for an enterprise will require you to tackle some issues that may take longer than they should to resolve due to complexity, the reluctance to share data and political issues. That being said, a single source for customer information will give your organization a tremendous advantage in leveraging your customer relationships for up-selling, cross-selling and building customer loyalty through better service.

Clay Rehm’s Answer: First off, this sounds like an exciting and fun project!

The customer table in the new customer marketing data mart (CMDM) must have its own unique primary identifier. As attributes, the table will contain a key from each source.

As far as slowly changing dimensions, let’s review what they are first:

  • Type 0 – Ignore the change
  • Type 1 – Overwrite the change
  • Type 2 – Create a new dimension record for the change
  • Type 3 – Add a "current" attribute to an existing dimension record
  • Type 4 – History table

You may need a SCD strategy for each source system, depending how the data is stored and how you extract and integrate it with the other data. Not knowing your data, it is hard to determine which method to use. However, you must be able to match and compare certain data elements such as first name, last name, middle name, phone, etc that is used in every source system.
Finally, I would recommend the retention of an experienced data warehouse consultant to guide and assist in this design.

Les Barbusinski’s Answer: Integrating data for a specific entity (i.e., Customer) from disparate source systems is never easy. Some information about the entity can be business unit or application specific, hence requiring little integration (e.g., sales transactions from different business units within the enterprise). Other information can be common across the enterprise, hence requiring significant integration (such as Customer name and address data).

Assigning a unique identifier is always a problem because matching the entity (in this case, a Customer) across multiple systems requires one-or-more matching attributes (such as name, SSN, date of birth, address, phone numbers, credit card number, etc.) that may vary in structure, content and/or quality from system to system. This almost always involves the use of complex and/or "fuzzy" matching logic that is never perfect. It also usually involves storing the source system identifiers for each entity within the affected warehouse table so that future updates from a specific source system can be more easily matched.

Synchronizing updates (i.e., determining whether the source record at hand represents an earlier or later snapshot of the entity than what is already in the data warehouse) usually involves storing the date/time of the last update from each source system in the warehouse table.

Assessing relative value of incoming data (i.e., determining whether the value from system A should override the value from system B) involves a lot of up-front analysis in order to establish a relative ranking of data quality by source system for logical groups of data elements. This relative ranking must then be built into your ETL logic.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access