Clean Up Your Data
InfoManagement Direct, December 2004
Nobody should have dirty data. After all, dirty data leads to dirty thinking. It intrudes on your ability to quickly and easily assemble information. It poses problems of use, problems of integration and problems of credibility - your credibility. If your data is dirty, you will not only be judged on its level of cleanliness, but on your value.
Imagine for a moment that you have a dog that sleeps on the living room couch. And that you have invited your grandma to dinner and offered her a seat on that same couch. Now, if that dog is dirty, and if your grandma is like mine, you have a topic of conversation that will haunt you in that dinner and in many future dinners to come.
Dirty data is even worse than dirty dogs on the living room couch. It get into everything, and we find ourselves cleaning up all the hidden little messes for ages. Both outcomes are particularly embarrassing when the guests arrive, and both outcomes can be avoided through a moderate amount of preventive cleansing efforts.
Advertisement
So I say, let's embrace the fact that we need to clean up our data dogs, and do it as soon as possible. Don't let them get up on the couch to share quality time with us until we've really hosed those data dogs down and made them living room worthy for grandma!
Cleaning Your Data Dogs
Like washing a dog with industrial brushes, gloves and tools to get in all those unmentionable places, data require specialized tools to get it clean. And just as with the dog, we often think of cleansing not only as the process of cleaning up to a conformed state that we can appreciate, i.e., getting rid of the stink and grime, but adding to the end product to have more information and pleasure. How many dogs get a bath without having their collar and dog tags placed back around their necks? And many will also get a good brushing, too. Cleaning data is similar in practice to cleaning the dog, but the tools for cleaning data are a little more complex.
The term "cleansing" is a little misleading. Like the dog, your data has lots of extra bits hanging off of it. In many cases, what we want to do is not just clean off the dirty bits but add things that will conform and adapt.
Before doing anything more with our data dog, we need to simply identify the dog. With both the dog and the data, we need to add a unique identifier. This identifier allows us to know it, such as where it came from and when. With the dog, we call this the license tag. With the data, we would call this the "key."
Now, if we want to take that dog to the next level where it will co-mingle and compete with other dogs, we need to provide even more conformed information to an externally referenced resource, such as to a particular kennel or to a governing agency such as the American Kennel Association (AKA). Just bringing my dog, Alex, to a dog show in his normal, filthy state would get us turned away. I would be told that not only does his odor or upkeep and manners disqualify him, but if I cannot provide the extra information they need, such as AKA registration, breed and provenance, local licensing and certificate of health, then I cannot bring him in to mix with the other dogs.
Cleansing data in this case has more to do with conforming what is there and adding more to it, so we can recognize our data for its uniqueness.
Data cleansing tools apply business rules across data feeds that lead to a standardized and conformed view of business data regardless of the systems they come from.
Data Cleansing in the Real World: It's not Just for the Dogs
Stepping away from intangible dogs, let's focus on a more practical issue of data management where cleansing will be an important part of your work.
Say your organization has people associated with it. Perhaps you sell things to people. Perhaps you hire people to sell things. Perhaps you hire people to support people to sell things to people. Perhaps the people you sell things to talk to each other via you. And let's just say that you want to know who all of these people, and all the combinations of these people, are. This is a classic example of where you are definitely going to want to come clean upfront, and where a cleansing tool comes in handy.
People are dirty... from a data point of view. Assuming that in the above example the "you" in this case deals with the tens, hundreds and possibly thousands of thousands of people on a regular basis, there is the likelihood that this scenario will leave you dazed and confused for a while. This is especially true as it would be very hard to imagine that there is just one system of record capturing all these people and their associations with your organization in a way that would let you find a set of unique persons. More likely, there are at least three or four separate systems collecting data about these people. And it is just as likely that data entry into these systems is not consistent across business rules.
Another complication is how and who enters the data. In some cases it may be that your customers themselves enter personal information such as "name" into a Web form with no checks and in others that a poorly paid or trained employee is just trying to meet a quota. Only in accounting systems such as payroll or benefits does one really find a motivation to get those people right at entry - and "right" itself is a very shaky concept. The point here is that multiple systems of record and multiple points of data entry inevitably lead to inconsistency across the same data. Ergo, dirty data!
Page 1 of 2.







