How to Improve Customer Data Quality
Information Management Newsletters, January 19, 2010
In any customer-centric business, be it hospitality, banking, retail or insurance, there are numerous touchpoints where the consumer interacts with the business. Many interactions take place between the consumer and the business through various direct and indirect channels: direct marketing campaigns (email, mailers, telemarketing, etc.); points of sale; information kiosks; online shopping portals; and feedback forms for services rendered.
During all these transactions or points of contact, consumer data is collected in varying ways. The trouble lies in the lack of a consistent framework in collecting consumer attributes. Most organizations collect the same consumer through multiple channels with no consistency in the attributes collected. Hence when these organizations build data warehouses and data marts to study consumer behavior, they lead to a large number of duplicates in the consumer tables in the warehouse or mart. This can be disastrous for any business.
It can result in multiple mailers to the same consumers or to consumers who have opted out of direct marketing campaigns, resulting in legal complications and loss of consumer loyalty. Any ROI analysis would yield skewed figures if consumer data is not consistent. A consistent or single view of consumer data across the enterprise is necessary to prevent such scenarios.
Advertisement
Consumer Deduplication Strategy
Data deduplication is the process of defining duplicate consumer data in consumer-centric databases and seeking corrective action to cleanse the data from the duplicates and ensuring that no coherent, accurate and relevant data is lost in the process.
Follow these steps to formulate and implement a successful consumer deduplication strategy.
1. Understand Data Quality
Data quality issues include inconsistency in attributes, invalid data and duplicate records. It is recommended that data quality be enhanced and issues be resolved before a deduplication process is run. This ensures that the deduplication process runs on better quality data.
2. Investigate Data and Data Quality Issues
Data investigation is important not only to determine the data quality issues but also to understand the key attributes needed to define a consumer uniquely based on data profiling. The records in the data environment under investigation must be a good representative sample of data quality issues and deduplication scenarios in the production database. Data investigation can be done with tools or manually, using written SQLs. Data patterns are better exposed by automated tools and may be a preferred approach.
3. Determine Match Rules and Criteria
Results of the data profiling exercise should be published and proposed consumer attributes to be used to match records must be understood and confirmed by business users of the system. This is important to ensure that the match criteria make business sense. Typically, matching can be of three types, namely commercial matching, household matching and individual matching.
Commercial matching involves matching businesses or consumers belonging to business houses. Household matching involves matching consumers to households. Often country specific third-party data is used to do household or family matching. There are, however, some scenarios which need to be handled when one deals with third-party data:
- Third-party data providers normally charge for each instance of consumer verification. This may turn out to be a costly, time-consuming exercise and is usually done once a month or at larger time intervals (like bimonthly or quarterly).
- Third-party consumer data may not exactly match the consumer data that an organization builds up over a period of time. When no data is found in the third-party database corresponding to consumer data in organizational database, a decision needs to be made on how these consumers will be matched.
Individual matching involves matching consumers belonging to the same household and is usually done after household matching. In some cases it may be useful to match based on other useful attributes of consumer such as number, name suffix, gender, etc. Usually matching is performed by data cleansing tools.
4. Identify Survivorship Criteria
Now that records belonging to the same matched group are identified, select a survivor record in each of the matched groups. Survivorship criteria is a product of the initial data investigation/data profiling exercise. It is highly recommended that business users agree with the survivorship criteria, because identifying survivors based on attributes that have limited business significance may be detrimental to the efficiency and quality of the deduplication process. The best way to identify the survivor is to retain the record that best matches the survivorship criteria. As consumer data often is highly sensitive, it is important to retain the best consumer data possible.
5. Determine Merge Rules and Criteria
Once the significant problem of finding the survivor is resolved, it is now important to realize that some attributes of the records marked as duplicates may be more recent, more complete or of better quality. In such an instance, it is necessary to merge these better attributes of the duplicate records into the survivor record. Again this action needs to be performed on the basis of merge rules/criteria. Merge rules are also defined on the basis of data investigation and data profiling. For instance, a merge rule could be to update the survivor records address field with the address field of the record with the longest address field values. Where there are date fields, it is necessary to retain the latest date (i.e.., indicating last change of address). It is highly recommended that these merge rules are certified by business users.
6. Maintain Survivor Duplicates Trail History
It is important to note that while a set of records may have been marked as duplicates and must be purged from the consumer related tables in the warehouse, it may be worthwhile to retain these deleted records in trail tables which store the relationship of the duplicate record to the survivor record as well as the deleted records attributes.
Page 1 of 2.







