Why do we get wrong answers when we combine two operational databases that were both known to have perfectly adequate data quality prior to their joining? Why do different users of the same database have totally different perspectives of the quality of its data? Why is it that data, once cleaned, doesn't seem to stay clean even though the record was never modified? Recently a reporter asked me, "Why hasn't the industry solved the data quality problem yet? It seems so straightforward." Like most human endeavors, if you don't understand the root of the problem, then your solutions will treat symptoms or at best provide only temporary relief. Despite enormous advances in technological remedies, the data quality problem seems more pervasive and tenacious than ever – like bacteria that has grown immune to antibiotics.

We can blame the Internet, the sheer explosion of data volumes, uncontrolled data entry from end users and customers, disparate content from external or poorly documented sources, or the torrent of new business needs that old data must satisfy. We would be right on all counts, but we'd still be missing the key to this dilemma. To paraphrase a favorite British playwright, the fault, dear Brutus, lies not in the data, but in ourselves.

Understanding the Misunderstanding

Most professionals associated with IT activities understand that meta data is not the data itself, but merely a description of the data – a description that is rarely complete, current and reliable (although that's an issue for a future article). We know meta data to be an abstraction and generalization of the thing that it describes; and we know severe problems arise when decisions based on meta data are invalidated by the realities of the underlying data values. Stepping back one level on the abstraction scale, why do we forget that the same set of issues applies to data as well?

Data has no intrinsic semantic reality; its very existence is predicated on the assumption that it represents some fact, object, attribute, event, etc. This is an assumption born of the necessity for social discourse, but plagued by the ambiguities of human language and the variability of human behavior and perceptions. In short, we repeatedly forget that data is not the thing it describes, but only a representation. As such, a single data value may be "true" to just one point of view at one moment in space and time.

The disciplines of general semantics and information theory tell us that a statement about a person, thing or event may be useful for drawing inferences, making predictions and facilitating human commerce. However, this leads to erroneous conclusions with potentially deadly consequences when we take the statement for absolute truth or as a proxy for the identity of the thing itself. How many people have been killed by a gun described as "unloaded" because they trusted the description rather than examining the reality itself? Sometimes the fallacy arises because of shifts in space and time. The gun may have previously been unloaded, but the operator was not aware of a recent loading event. Or, the unloaded gun was moved to a difference location and replaced with one of similar appearance.

Outlaws Among Us

What does this have to do with data quality, you ask? Everything. The erroneous belief that data is either right or wrong (and that, if it's wrong, it can be cleansed into a state of absolute and permanent correctness) is the ultimate violation of the laws of general semantics. This simple misconception, still widely held by IT, business management and end users, is responsible for the bulk of the billions of dollars in project failures, cost overruns, delayed ROI on enterprise initiatives and loss of customer loyalty and revenue. The land of data quality is overrun with outlaws – people who break the rules of linguistic semantics through misunderstanding the fundamental relationship between words (data values) and reality and then blame the words for the resulting misinformation that occurs.

One Data Sin Begets Another

From this first-order "original sin of data quality," there arises a series of second-order misdemeanors committed by IT and business users on a daily basis. These violations include the assumptions that data has a dualistic state – it's either right or wrong; that data can and should be standardized and validated; and that data, once finally fixed, will retain its quality. These second-order fallacies occupy the attention of most traditional data cleansing activities. However, all too often these solutions just treat the symptoms, which is why the process fails to serve multiple user communities and fails to have lasting effect. Understanding the inherent misconceptions here will enable more cost-effective deployment of data cleansing technologies.

The assumption that a value is either correct or incorrect, clean or dirty, is an oversimplification of the reality these values represent. Generalization distorts reality as does attributing absolute "correctness" to a data value. "Cleansing" dirty data often just moves the value to a different position on the continuum of relative representations. Add a new set of users, a new business usage or just wait a few months, and you're likely to find the data value is once again "lacking in useful data quality" and needs to be "fixed" yet again. Data quality is relative to the time and the context of its initial capture or subsequent fixing.

For example, a customer's telephone number and address were validated when recorded in the database; but recently the town renumbered houses to better facilitate 911 emergency response. The local phone company reassigned area codes to handle increased demand. Now the data is still accurate as a legal representation of that original business transaction but no longer useful as a means of contacting the customer. These are examples of the temporal relativity of data quality. Spatial dimensions also impact the experience of data quality. National boundaries can determine the format for expressing a date, currency conversion factors, even the interpretation of "local" time – not trivial challenges given the global reach of today's "always-on" users.

There Is No Standard in Data

The choice of a standard is determined by the purpose for which the thing being measured is used (politics and prejudice notwithstanding). For example, what is the "standard" unit of measure for describing the dimensions of a piece of furniture? Feet and gross fractions of an inch are sufficient for the sales catalog or retail store, but the factory may need to calibrate machinery to thousandths of a centimeter. Thus there are multiple correct answers to the question, "What is the size of the tabletop?" What about the multiple phrases that find their way into an address field – are they good candidates for standardizing? Sure, but the post office favors a P.O. box for mail delivery, whereas freight and parcel delivery companies insist on a building-number type of address. Imposing a single standard upon a class of data values will work fine for some purposes but result in unusable data for other purposes.

Data quality is relative to the use and user. Names are even more problematic. Marriage, divorce and spiritual conversions are some of the reasons individuals will change or augment their names. In many cultures, it is quite common for a person to have a multitude of names and to vary the usage of those names based on their feelings or perceptions of the social context. The result is that you can neither predict nor constrain how persons will choose to represent themselves from one interaction to the next. Multiple variations will exist, and none are fundamentally more or less correct. There is no standard in data.

The Four Thieves of Data Reality

These second-order data quality fallacies give rise to a set of human behaviors that rob the corporate world of its ability to properly confront and resolve data quality realities. Thus far, too many IT professionals and business executives still underestimate what bad data is costing them and what's required to achieve acceptable data quality levels. These thieves are denial, deception, deflection and deferral – and their actions serve to bury the true problem until it explodes into view resulting in lost customers, discredited or failed IT deliverables, or legal exposure or soaring economic liabilities.

Denial arises when people assume that old data will adequately serve new uses without the need for a quality reengineering process. Just because the data was okay for some previous purpose does not guarantee its suitability for a new system or new set of users, particularly when that data will now be combined with multiple other disparate data sources, each with its own legacy characteristics.

Deception occurs when users assume that their new "enterprise" application software will magically solve the problem itself. Just because the new enterprise resource planning (ERP) or customer relationship management (CRM) system provides a data model and presentation layer designed to support an enterprise-wide view of customers, suppliers, parts, etc., doesn't mean (contrary to implied marketing hype) that it can spin flax into gold. These systems currently lack the necessary data quality technologies to accomplish the heavy pick-and-shovel work needed to reengineer legacy data. Additionally, in some cases they lack the functional sophistication to "serve-up" multiple "versions of the truth" based on varying user contexts – they too still suffer from the overly simplistic view that there can be a single absolute representation of a customer or event.

Deflection is the strategy for shifting responsibility to another party (i.e., it's somebody else's problem). IT may claim it's a user problem. The new user will point back to IT or to the original owners/creators of the data – it's a data entry problem. Or, everybody may expect that "everything" gets done by the system integrator being paid to implement the new system – in which case you'd better look closely at the contract's fine print!

Lastly, deferral is the strategy for shifting responsibility to a later point in time – after the current parties have hopefully declared victory and quickly moved on in their careers. This "fix-it-later-after-implementation" strategy is naive at best; downstream corrections cost at least 10 times more than those made at implementation. In some cases, it can't be corrected, due to inflexible data model assumptions or loss of end-user support, in which case the system is abandoned.

Awareness is the Driver; Technology is the Vehicle

Is there any hope of achieving data quality? Absolutely. Once you understand the root problem, it's "relatively" easy to understand what is needed to achieve a lasting solution.

First, you (and the entire IT, analyst and media community) must be evangelistic in promoting the message that data quality is a relative state measured by the "fitness for use" – a measure that will vary depending on the users and their intended usage. There will be multiple versions of the truth. An enterprise-wide view of essential business entities such as people, places and things doesn't mean a single instance, just a single key linking all the past and present representations and the contexts in which that data was created. Continuing education will also stress that data quality is not a one-time "clean-up" – it's a continuous process of real-time filters and reengineering functions. With this awareness in place, denial, deception, deflection and deferral will be banished – your environment becomes receptive for real data quality automation.

Second, you're ready to implement technology that will have lasting impact because you now recognize the complexities that need to be solved and the sophistication of the solution that will be required. You will need technologies for extracting, profiling, transporting, transforming, parsing, standardizing, indexing, fuzzy matching and reformatting. You will have to maintain multiple representations and understand their data lineage and context so that when queried or delivering results, your systems will serve-up the appropriate version of the truth for the particular user or usage. Lastly, these technologies must work together efficiently to handle the massive volumes that will increasingly tax your hardware capacities.

As you might imagine, costs vary greatly depending on the complexity of the data and affected business processes. In general, costs escalate in proportion to the amount of risk and/or reward inherent in underlying usage transactions. In other words, high-visibility or high-dollar transactions will necessitate, and justify, very high data quality – which, in turn, is achieved with a higher cost of automation. Building these solutions in house or as a turnkey deliverable is not economically feasible as a long-term total cost of ownership (TCO) strategy.

Fortunately, costs for software license fees are competitive, and some vendors will minimize your cash flow with pay-as-you-go transactional pricing models. Labor may be your most expensive cost. This is particularly true for data that lacks a history of automation, such as product catalogs or external supplier data. You can minimize costs of achieving interoperability by sourcing components from vendors with enterprise data integration suites. Don't forget that customized business rules will add to the effort. One size does not fit all, and you will not get satisfaction from the shrink-wrapped "data-cleansers" designed to prepare bulk mailings for postal discounts.

Given the specialized nature of this "low-level" data work, some companies will outsource the setup to systems integrators or vendors with proven expertise. Rationalizing your data and configuring the software to automate your enterprise integration requirements will generally take several weeks and account for 10 to 15 percent of your total costs. For some very data-intensive sites, where the values are particularly complex and inconsistent and the number of sources extremely high, the process can take several months and consume a higher proportion of your system deployment costs.

With a program combining awareness of data realities with today's sophisticated technology, there is light at the end of the proverbial tunnel. Companies can indeed achieve the critical levels of data quality required to gain competitive advantage in the marketplace and to make the most of their enterprise applications that are dependent on the quality of underlying data. In the final analysis, we all need to take that first critical step of recognizing what the true root problems are as we begin to seriously address data quality issues. In the words of a famous American cartoonist, the late Walt Kelley, "We have seen the enemy, and he is us."

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access