For those who make their living working with and leveraging structured data in enterprise applications, the stone-cold reality of today’s landscape is this: corporate data is inherently imperfect.


This may not seem like a revelation to those in the trenches who understand the nature of enterprise databases and the methods of actually collecting data into these applications. Yet organizations of all sizes are negatively impacted by imperfection, duplication and inaccuracy in the data they use to make business-critical decisions every day - without understanding the harmful effects of this data. In an ideal world, all the structured data that companies use to operate and make critical decisions would be perfect. But in the real world, it just doesn’t work that way.


For those companies that understand the inherent imperfection of structured corporate data, many different approaches are taken and significant resources devoted to cleaning and standardizing data. While some companies are successful at significantly improving the quality of their data, it is practically impossible to reach a point where all structured data is perfect and stays perfect as it is used and updated.


The simple fact is that imperfect data needs to be made usable despite its inherent imperfections. This in turn will create opportunities to use data to benefit the business without concern for spending valuable time and resources throwing “solutions” at the problem that simply don’t work. Ultimately, companies that are able achieve this will realize more value from working with imperfect data, as the associated risks are mitigated and costs are reduced.


The Root of the Problem


Before addressing the problem of data imperfection, it’s important to understand the root of the problem in order to design a solution that works.


In the past, before corporate data was stored and managed by databases, organizations such as enterprises, hospitals and government agencies had departments dedicated to managing paper-based records and files - let’s call it “the old-fashioned way.”


When dealing with these files, humans were on hand to file and manage all of the data within individual records. They also were able to recognize the natural variations and nuances that occurred within the data. For instance, if one record had a patient listed as Stefanos Damianakis and another had written Stafano Damianekis, a person filing the data would interpret this inconsistency and determine whether the two were referring to the same person.


Fast forward to 2008 and the landscape looks completely different. In the modern corporate environment, structured data is managed entirely via database applications, which by design only recognize exact matches. This can be enormously efficient because humans can’t match the speed of a computer no matter how quickly they’re rifling through information. But, with this speed comes a significant limitation - only exact matches are possible.


As another example consider a hospital that maintains a database of patients, many of which have visited multiple times. On one of these visits, the hospital employee charged with looking up patient data is unable to find the person’s name in the database, because he or she accidentally misspelled the person’s name in the search process. As a result, the employee creates another entry in the database, resulting in a duplicate record that lacks vital information about the patient’s history. What happens when the patient then returns for a critical procedure, but the duplicate record is used to provide the doctor with his or her information? What if the person is allergic to a particular anesthesia? Suddenly imperfect data goes from a simple business issue to literally life-or-death situation.


When it comes to recognizing and dealing with inconsistencies and errors in structured data sources, the previous example is where traditional rules-based systems break down. How is the software going to function correctly despite the differences in enterprise data. The speed delivered by a computer program may seem desirable at the outset of implementation, but it will ultimately create more problems for the organization in the long run.


The most common method companies use to address the problem of imperfect data is to implement technology that requires the creation of manual rules (with or without a probabilistic component) in order to electronically fix the problem and ensure that a computer can deal with specific inconsistencies. The problem with these solutions, however, is that the rules created are static, require an inordinate amount of guesswork to develop and only work to address a known problem at a specific point in time. When new data arrives in the database that fails to meet the criteria encompassed in these probabilistic rules, history repeats itself and the problem of inherently imperfect data manifests itself all over again. Additionally, rules-based solutions are difficult, if not impossible, to adjust and update over time. The difficulty is due to the educated guesswork required for all of the different parameters of every single rule.


We’re back at square one.


Mathematics Hold the Key


Because rules-based probabilistic solutions have been in use for so long, there is a commonly held belief that there isn’t a better way to approach the problem of data matching. This is patently untrue. The key lies in applying mathematical modeling. Mathematical modeling can successfully emulate the decision-making ability of a human being, mimicking actual staff members used to identify matches in imperfect data and take the necessary steps to rectify the problem.


By applying mathematical modeling to the matching of structured data, companies can eliminate the guesswork and manual labor inherent in probabilistic rules-based solutions that are typically implemented by organizations. Sophisticated algorithms and machine-learning that are based on mathematical modeling can be engineered. When new data and sources are added into the equation, the computer can correctly identify and pinpoint the variations that lead to record duplication and other matching inconsistencies. In effect, the computer can behave more like a human being in overcoming data imperfections.


It’s this human element that is absolutely critical to and the chief differentiator of mathematical modeling.


Overcoming these fundamental and perpetual imperfections in structured data requires the ability to extract and replicate knowledge from domain experts - human beings. Sometimes human expertise is difficult, if not impossible, to extract and automate with explicit rule sets. An example of something that humans do very easily, without even thinking about it, is recognizing speech. But can any human being on the planet generate a set of rules that explains the steps to take (i.e., the algorithm) to recognize speech? Of course not - it’s not possible. So how can we make software that performs tasks - via rules - that are impossible for humans to express?


What mathematical modeling sets out to achieve is to emulate the decision that was reached rather than the process that the human used to reach it. It’s the one method - more so than any probabilistic rules-based solutions - that has come closest to actually reaching this goal. By implementing a mathematical modeling solution to a company’s data matching problem, inordinate amounts of time and guesswork are saved, and matches are made correctly in a fraction of the time that manually scrubbing and standardizing imperfect data would take.


The Bottom Line


What mathematical modeling ultimately achieves is giving the users of structured data, whether they be in hospitals, government agencies or enterprises, the freedom to find, match and link the information that they need quickly and efficiently. This may not seem like a business-critical endeavor, but when companies are relying on this information to run their business, it suddenly takes on an entirely new meaning.


It all comes down to this - if you’re building a business based on the quality of structured data, how much data imperfection can be tolerated? Can you sleep at night knowing that the information you’re using to run your operation on a day-to-day basis may hold inconsistencies and errors that can lead to revenue loss and missed opportunities? How can you rest knowing that any imperfection in data could manifest itself at any time?


The problem of data imperfection is one that many C-level executives don’t realize they have and many IT departments are simply unprepared to deal with. As enterprise applications grow more complex, it’s a problem that will only be exacerbated in the coming years. By applying mathematical modeling to emulate human decision-making and judgment, organizations can learn to work with imperfect data rather than spending countless man-hours in standardization that may never solve the problem. In this sense, they will not only be prepared to deal with problems that may be caused by imperfect data in their current IT investments, but they will also be prepared to overcome these challenges as they are presented in the future of their organization.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access