Continue in 2 seconds

According to Mr.Inmon, data cleansing is most important phase for data warehousing.

By
  • Joe Oates, Sid Adelman, Chuck Kelley, Scott Howard
Published
  • July 02 2003, 1:00am EDT
More in

 

 

Q:

According to Mr.Inmon, data cleansing is most important phase for data warehousing. He says that it.s almost 60 percent of the whole DWH project. Do you have any special methodology to handle the data cleansing process?

A:

Sid Adelman's Answer: Take a look at Business Intelligence Roadmap by Larissa T. Moss and Shaku Atre.

Chuck Kelley's Answer: I don't have any special methodology, but I have three types of data cleansing processes: 1) poor man's cleansing, 2) getting it 80 percent right, and 3) perfection. Poor man's cleansing is basically doing the best you can. Maybe taking postal code and state and standardizing the city/county data. Getting it 80 percent is working on the data to make it really clean, but not spending a lot to do it. Perfection is spending wildly to get every piece of data perfect.

Which do you choose? Whichever one makes the smartest investment of time and money. Yes, perfection should be the goal, but the cost should be taken into account. What if you are a mailing house and you send a piece of junk mail to the same person twice. What is the cost of that versus the cost of cleaning to perfection? It is a simple case of "do the math."

Scott Howard's Answer: Data cleansing is truly critical to the success of any solution. I don't know if I agree with Mr. Inmon.s generic proclamation, but I understand his intention to emphasize data cleansing's importance. Is there a single methodology? No, because there is no single form of dirty or compromised data or single cause. The best methodology is to recognize the cause and attack the problem at the source. Given that and the fact that all source issues can not be prevented, you need to identify the most critical cleansing issues for your own individual business and seek a solution for those specific issues.

Many vendors provide tools specific to some data cleansing needs. The most full featured is probably Ascential Software's Integrity. Ascential's definition, available from their Web site, give you the most commercially complete definition of a data cleansing methodology. However, read it cautiously as it maps directly to Integrity's capabilities. Weigh whether all the points and features promoted apply to your individual needs or problems. You don't want to compile a checklist full of non-applicable attributes.

Joe Oates' Answer: Certainly data quality, or more specifically the lack thereof, is a serious issue when planning and building a data warehouse project. Unfortunately, management in too many organizations does not understand what a big problem this can be.

Rather than try to outline a methodology, I will refer you to two articles and an example of a data cleansing project plan that I think will help you. The first article is by Larry English. Larry is one of the leading authorities on data quality. This article is found at http://www.dmreview.com/master.cfm? NavID=55&EdID=1669. The second article is by Larissa Moss and is found at http://www.dmreview.com/master.cfm? NavID=55&EdID=828.

An example of a data cleansing project plan is found at DataCleansingApproachD42FINA LRev1.doc. I hope that this helps.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access