Data profiling is the process of studying and analyzing data to validate it against expected data formats and values. Data profiling has traditionally been the purview of the data warehouse, where it was used against disparate source systems to yield data standardization and unification rules as part of extract, transform and load (ETL) processing. Data profiling for data warehouses is less critical than it is for customer data integration (CDI) hubs.

Not Your Father's Data Profiling

Data profiling takes on a whole new dimension when used for CDI. CDI involves the use of a centralized, transactional hub, which reconciles, unifies and integrates customer-specific data. Such hubs are more targeted to customer data, thus rendering data profiling narrower in scope. The data profiling is targeted to customer's personally identifiable information, including name, address, phone and email. But the complexity and depth of analysis of these elements increase due to the very nature of this data.

As CDI becomes central to a company's overall corporate customer-centric business strategy, CDI data should be reliable, accurate and meaningful due to its impact on other corporate systems and accurate customer identification. Matching success has a direct bearing on the customer data quality itself. The exact data quality functions (and implementation of business rules) that the data quality component performs depend on what needs to be fixed in the customer data. This is determined by data profiling.

I'd go so far as to say that best practice CDI implementations depend on how well customer data is profiled early on in the implementation as well as whether data profiling becomes a standard and ongoing CDI task. Moreover, data profiling can influence the overall cost of a CDI solution. With the prevailing energy and excitement among most IT departments to implement CDI, many still lack awareness of how critical data profiling is.

Consider a CDI implementation at a major telcom company. The IT organization was initially reluctant to endorse data profiling. Ironically, the business advocated profiling the data. Users had seen bad data from the billing system show up in their data warehouse, a phenomenon that had caused many BI end users to shun the enterprise data warehouse in favor of more limited local data marts. CDI business stakeholders understood that the value of the new single version of the truth delivered by the CDI hub would be proportional to the quality of the data.

Data Profiling Results for CDI

We performed the initial data profiling activity against a customer source data set that ranged into tens of millions of records. We found some familiar errors, but we landed some new discoveries, too.

For one thing, the source system city names were abbreviated (due to legacy field size limitations) and were unreliable. Thus, addresses coming from that source system could not be matched with other source system data. The CDI hub had to derive the correct city names using postal directories.

We also discovered that the processing architecture involved a two-stage matching process in which a customer's Social Security number (SSN) would be matched first, and any so-called "fallouts" would be matched on name and addresses. Our data profiling work found that records with valid SSNs were a small percentage of total records. Matching on the SSN field had little value when considering the additional processing overhead that would be needed to match the remainder of the records. We modified the data architecture to include single-step matching involving only name and address, thus reducing overall development costs and complexity.

In addition, we found that some customer accounts had identical SSNs but entirely different names and addresses. This indicated potential fraudulent activity. We found several accounts with multiple names, like "John and Mary Smith," which represented a violation of the business rule that there can be only a single financially responsible party assigned to a given account. We also found several extraneous data tokens embedded in the source system address fields, such as c/o (care of) and attn: accounts payable. This finding drove some much-needed data quality process refinements.

As with data warehouses, CDI hubs are only as good as their data. A CDI development plan should include formal and repeatable data profiling. It will take you a long way toward improving that data before it is processed and matched in the hub, saving time, money and manual data reconciliation efforts. Problem solved!

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access