For those who make their living working with and leveraging structured data in enterprise applications, the stone-cold reality of todays landscape is this: corporate data is inherently imperfect.
This may not seem like a revelation to those in the trenches who understand the nature of enterprise databases and the methods of actually collecting data into these applications. Yet organizations of all sizes are negatively impacted by imperfection, duplication and inaccuracy in the data they use to make business-critical decisions every day - without understanding the harmful effects of this data. In an ideal world, all the structured data that companies use to operate and make critical decisions would be perfect. But in the real world, it just doesnt work that way.
For those companies that understand the inherent imperfection of structured corporate data, many different approaches are taken and significant resources devoted to cleaning and standardizing data. While some companies are successful at significantly improving the quality of their data, it is practically impossible to reach a point where all structured data is perfect and stays perfect as it is used and updated.
The simple fact is that imperfect data needs to be made usable despite its inherent imperfections. This in turn will create opportunities to use data to benefit the business without concern for spending valuable time and resources throwing solutions at the problem that simply dont work. Ultimately, companies that are able achieve this will realize more value from working with imperfect data, as the associated risks are mitigated and costs are reduced.
The Root of the Problem
Before addressing the problem of data imperfection, its important to understand the root of the problem in order to design a solution that works.
In the past, before corporate data was stored and managed by databases, organizations such as enterprises, hospitals and government agencies had departments dedicated to managing paper-based records and files - lets call it the old-fashioned way.
When dealing with these files, humans were on hand to file and manage all of the data within individual records. They also were able to recognize the natural variations and nuances that occurred within the data. For instance, if one record had a patient listed as Stefanos Damianakis and another had written Stafano Damianekis, a person filing the data would interpret this inconsistency and determine whether the two were referring to the same person.
Fast forward to 2008 and the landscape looks completely different. In the modern corporate environment, structured data is managed entirely via database applications, which by design only recognize exact matches. This can be enormously efficient because humans cant match the speed of a computer no matter how quickly theyre rifling through information. But, with this speed comes a significant limitation - only exact matches are possible.
As another example consider a hospital that maintains a database of patients, many of which have visited multiple times. On one of these visits, the hospital employee charged with looking up patient data is unable to find the persons name in the database, because he or she accidentally misspelled the persons name in the search process. As a result, the employee creates another entry in the database, resulting in a duplicate record that lacks vital information about the patients history. What happens when the patient then returns for a critical procedure, but the duplicate record is used to provide the doctor with his or her information? What if the person is allergic to a particular anesthesia? Suddenly imperfect data goes from a simple business issue to literally life-or-death situation.
When it comes to recognizing and dealing with inconsistencies and errors in structured data sources, the previous example is where traditional rules-based systems break down. How is the software going to function correctly despite the differences in enterprise data. The speed delivered by a computer program may seem desirable at the outset of implementation, but it will ultimately create more problems for the organization in the long run.









