Continue in 2 seconds

Data Cleansing

Published
  • March 01 1999, 1:00am EST

What is clean data, and why do we need to have our data cleansed? After all, it wasn't long ago that just having valid data was sufficient. What is the difference, anyway? Let's see if we can answer these questions. First, let's look at the subject from a purely systemic point of view. Data cleansing is an important part of data acquisition. It consists of four steps that really haven't changed since we first started storing data in computers:

  1. Establish a baseline set of values, a validation table.
  2. Audit acquired data against this baseline set of values.
  3. Research any acquired data element that fails the audit.
  4. Based on the research, either reject the data element or use it to update the baseline set of values.

The important distinction between data cleansing and data validation is that validation (edit checking) merely examines acquired data to insure that the necessary formatting rules have been followed.
While the data cleansing process has not changed significantly through the years, what has changed is the end result we expect to receive from data cleansing and the new complexities now required in each of the above steps due to our heightened expectations.

It is no longer acceptable for a computer to act like an electronic typewriter. Today's word processors must include a spell checker, grammar checker and thesaurus, which are used to improve the quality of your documents. Likewise, it is no longer sufficient for computers to spit out columnar reports and simple bar charts. Spreadsheets were wonderful ten years ago, but the current demand is for greater levels of deductive reasoning. Computers must now supply answers to problems of reason, not simple computations. They must use neural networks to do data mining in order to produce actionable reports.

To accomplish this computers need more and more data, much of it subjective rather than objective. To store this information, relational data storage structures have become larger and more complex. For example, the old customer master file has become a customer repository containing much larger quantities of interrelated facts about every customer. These facts can't simply be lines of text. To enable the extraction of actionable information (data mining), each of these facts must be recognizable and distinguishable.

As the facts we store have become more complex, so have the requirements of recognizing, distinguishing and relating these facts at the point of acquisition. Just as we need neural networks to properly identify the facts for data mining, we need neural recognition processes to properly identify and integrate acquired data. These processes must use repositories of information, not simple validation tables or master files that use single keys. They must use sophisticated matching algorithms that take into account phonetics, abbreviations and subject-specific terminology. And finally they must use statistical probabilities backed up by manual review to match acquired data to repository data instead of simple character mapping.

Let's use the pharmaceutical industry as an example to demonstrate the importance of data cleansing in today's information management environment. The customer base of a typical pharmaceutical company has been very dynamic in the last decade. But the core component has always been made up of physicians ­ the people who prescribe the drugs. Naturally, the company wants to collect as much information about these physicians as possible. What are their specialties, their preferred treatments, their prescribing history and tendencies? The company will typically use many sources to collect this information, not the least of which are their own sales representatives. But today, physicians can be identified in many different ways: by name (which can vary with marital status), by American Medical Association (AMA) number, by several different Drug Enforcement Agency (DEA) numbers depending on how many practice locations are used, by several different State License Numbers (SLN) depending on how many states the physician practices in, etc. In addition, the source data may have been received in files from an association like the AMA or a State Licensing Board, keyed in by the sales representatives, visually scanned or entered via audio recognition devices.

Regardless of the source, the information will be useless unless the data cleansing process used by this company includes the following:

  1. A customer repository that contains all the multiple physician identifiers and the relationships between them.
  2. The ability to electronically distinguish which physician identifier(s) exist in the source information.
  3. The ability to electronically recognize the values of those identifiers in order to match them to values in the repository using sophisticated matching algorithms and statistical probabilities that take into account such things as phonetics.
  4. The ability to augment the purely electronic process with computer-aided manual processes to validate new values not previously contained in the repository.

From this example, it is easy to see that the greatest value derived from data cleansing is that it allows us to properly integrate acquired data. This subsequently allows us to employ tools, such as data mining technology, to deliver answers to our business issues or questions.
We have progressed from a time when the primary function of data cleansing was to prevent software from crashing, through an era when it was used simply as a way to place objective facts in proper columns, to the current age of subjective data recognition and integration.

Now, let's enlarge the scope of the data cleansing process to reflect today's demands and answer our original question: What does data cleansing entail?

  1. Establish a baseline set of values, a repository of previously accepted values and relationships.
  2. Audit acquired data against this repository using sophisticated matching algorithms and statistical probabilities. Research acquired elements that are found to match but vary slightly from repository values. Based on the research, either revise (clean) the acquired elements or add them to the repository as new values.
  3. Research any acquired data element that fails the audit using a combination of computer-aided and totally manual research techniques.
  4. Based on the research, either reject the data element or use it to enhance the repository.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access