Free Site RegistrationFree Site Registration

Data Cleansing Strategies

InfoManagement Direct, October 2004

Kuldeep Dongre

The presence of data alone does not ensure that all the management functions and decisions can be smoothly undertaken. There is a compulsive requirement for the data to be meaningful or, in other words, data quality is of utmost importance if management is to take any advantage of the data at their disposal.

Data quality pertains to issues such as:

  • Accuracy
  • Integrity
  • Cleanliness
  • Correctness
  • Completeness
  • Consistency

The quality of data is often evaluated to determine usability and to establish the processes necessary for improving data quality. Data quality may be measured objectively or subjectively. Data quality is a state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.

Advertisement

This article explores the various factors that make data cleansing of the legacy system inevitable and provides strategies that can be adopted. Moreover, it deals with the factors that determine the choice of a particular strategy for data cleansing.

Problem Definition

Practically, any application contains dirty data - data that is meaningless, is not representative of the business it is used in, has some obvious error or becomes meaningless in the new application environment once the legacy system is converted and migrated. I will analyze two scenarios wherein the data is required to be converted and migrated to a new application (target system) from the existing application, which is henceforth referred to as legacy system, and when the legacy system is to be retained.

Data Quality Issues - Root Cause Analysis

Data quality issues in the legacy system arise because of the following factors:

  • Application ErrorS: These data errors creep in because of the inability of the legacy system to validate certain user inputs.
  • Human ErrorS: This is a major source of dirty data in the legacy system. A major chunk of these errors can be attributed to the legacy system's inability to validate data, but some errors are logical in nature. For instance, consider a date field that refers to the purchase date for a particular piece of equipment. Now, the user may input a date, which can be valid but can be wrong from a business perspective (i.e., the user may enter a date in which the business never existed).
  • Deliberate Manipulations: A user may be forced by the legacy system to enter data that is prima-facie incorrect but is inevitable because the legacy system would reject the data otherwise. Another source for these deliberate manipulations is the legacy system user who may fudge the data to fulfil certain ends, which may be unethical.
  • Target system model definition: This can be a factor only when a legacy system data conversion and migration is part of the cleansing project. The target system model may dictate the data to be in a certain format, which cannot be found in the legacy system. Though this is not essentially dirty data, the need for conversion and migration makes it compulsive for the business to cleanse it. Alternatively, depending on the difficulty level of conversion required, it can be incorporated in the conversion process.

The presence of this dirty data poses a serious threat to the management and can affect the decisions that are taken. Data cleansing initiative is a direct consequence of the inability of the management to translate data at hand into effective, winning decisions.

Data Cleansing Methodology

In a data conversion project the main objective is to convert and migrate clean data into target system. This calls for a need to cleanse legacy data. Cleansing can be an elaborate process depending on the method chosen and has to be planned carefully to achieve the objective of elimination of dirty data.

Some methods to accomplish the task of data cleansing of legacy system include:

  • Automated data cleansing
  • Manual data cleansing
  • The combined cleansing process

Automated Data Cleansing

The generalized method of carrying out an automated data cleansing is detailed below. Refer to Figure 1 for a diagrammatic representation of the process.

  1. Error Identification Process (Data Audit): The first and foremost step is to identify and categorise the various errors in the legacy system. This is also called the data audit process. This is generally done after a study of the functionality of the legacy system with the help of business analysts. A data audit would reveal the volume of these error types. A data audit process will provide:
    • The error types that need cleansing. These are called as critical errors types.
    • The error types that can safely be ignored as they are not business critical. These can be classified as non-critical error types.
    • Data volume of each of the critical error types.

    Various error types are identified for simplicity of tackling it programmatically in the subsequent data cleansing process.

  2. Error Reporting: It is necessary to verify the dirty data that is identified and extracted by the error identification process. Business analysts are involved in the verification of this data. This verification is done through user-friendly reports of dirty data based on error type.


    Figure 1: Automated Data Cleansing

  3. Automated Data Cleansing: This can typically be a batch process to correct the dirty data based on the error types. All logical error types in data structure can be corrected through this programmed cleansing process. However, one should be mindful of the fact that there will be some potential error types that cannot be corrected by the automated process. Manual intervention is the only way to deal with such error types. This method is dealt in detail later in this particle.
  4. Post Cleanse Data Audit: Data cleansed automatically by the batch process is required to be verified by re-run of error identification process. This is to ensure the successful completion and flawless functioning of the automatic data cleansing process.
  5. Legacy System Update: This is a crucial step of the process. The data cleanse process is normally run on a different set of target data sources which are populated at the end of the automated data cleanse process. This separate data-area is called cleansing staging area. This is to avoid any irreversible, accidental, incorrect changes to the production legacy system. It is not a very good idea to update the production legacy system directly from the cleansing process, as in doing so we would not be dealing with the error prone data set but with the entire data set. Instead, the cleansed data in a different data source is used to update the production legacy system after ensuring that the data is clean (in step 4). The end result of this would be a clean legacy system.

Page 1 of 3.

Advertisement

Advertisement