I'm still amazed how many data warehousing projects gloss over the all-important step of scrutinizing source data before designing data models and ETL mappings. For example, one company designed its data model and extract, transform and load (ETL) scripts around a database that contained 11.5 million customers, only to realize later that it actually held only 1.5 million distinct customer records. This forced the firm to rewrite its target models at great expense.
Code, Load and Explode!
Many teams make the mistake of thinking that they understand their source data because they work with the transaction system that generates it. They are in such a rush to begin the project that they trust their assumptions about the source data rather than undertake the traditionally tedious and time-consuming task of interrogating the data. They simply fire off a few SQL queries against a sample of the data to validate its contents, and then they start coding in earnest.
However, these teams often get halfway through their project only to discover that the source data doesn't behave according to their assumptions! The teams find that character fields contain numbers, the gender field has five distinct values, invoices reference nonexistent customers, sales orders have negative values and so on.
Usually, these teams then consult with a subject matter expert and create new rules to fix this "bad" data. They run the process again and the same thing happens - they discover new data defects and again have to stop and rewrite the ETL code. As schedules balloon, costs escalate and tensions rise, they get stuck in an endless loop caused by undiscovered errors that pervade their data sources. One data warehousing manager has described this pernicious cycle as "code, load and explode."
Data Quality Obstacles
According to data warehousing professionals, the top two challenges in implementing ETL tools are "ensuring adequate data quality" and "understanding source data."1 Most data warehousing professionals learn the hard way that fixing unanticipated defects and anomalies in source data is the primary reason for project delays, overruns and failures.
Yet, the "code, load and explode" phenomenon is not relegated to data warehousing; it afflicts any project that tries to integrate data from multiple systems. These projects include customer relationship management systems, supply chain applications and data migration initiatives. Most of the high-profile "failures" in these areas are due to organizations that underestimate the data quality problems in their source data and the time and resources it takes to fix them.
The Limits of Current Practices
Discovering Errors Too Late. Despite the growing publicity about the impact of poor quality data on data integration initiatives, most organizations take only minimal steps to understand - or "profile" - the data they want to integrate. In fact, according to a TDWI survey, almost two-thirds (62 percent) identify data quality problems after the fact when users complain about errors in the new system (see Figure 1). When customers discover data defects in a system, they lose trust in it. If the quality problems are severe and persistent enough, the system can fail from lack of use.
Figure 1: How Do You Determine the Quality of Data?
Most organizations rely on user complaints and SQL queries to identify data defects. Less than one-fifth (17 percent) of organizations use data profiling tools. Based on 640 respondents from "Data Quality and the Bottom Line," TDWI Report Series, p. 21, www.tdwi.org/research.
Unreliable Meta Data. Part of the problem is that organizations rely too heavily on system catalogs (i.e., meta data) to provide an accurate description of the contents of source systems. These catalogs are notoriously unreliable because users often co-opt existing fields to enter new types of information.
For instance, telemarketers may insert cross-sell response codes into a birthday field because the marketing department wants to capture this data and there is no other place for them to enter it. In other cases, programmers who designed the system never documented its contents, or they left the company, taking their knowledge with them. Additionally, where catalogs do exist, administrators rarely keep them up to date.
Also, data entry errors and poor validation routines permit errors to enter the system. Because many errors don't affect day-to-day operational processes, managers assume the data is accurate and consistent when it isn't. They only discover these "latent" errors when they try to integrate supposedly identical fields in two or more systems and nothing matches up.
Manual Profiling. Another reason that analysts fail to discover data quality problems in advance is because they use manual methods to profile the contents of source systems. According to the previously referenced TDWI survey, a majority of organizations (57 percent) issue SQL queries to analyze source data. This usually means that they simply sample data in a few key fields to get a sense of what the data is like in those columns.
The hypothesis-driven approach takes too long and is error prone. It depends on the skill of the analyst to write accurate SQL and understand the business and its systems well enough to know what to look for in advance. This approach is like drilling for oil without a geologic map - there's little chance you'll find a deposit even if it's right underneath your feet!
The Answer: Automated Profiling Tools
To better understand source data and minimize risks in data integration projects, smart organizations are abandoning manual methods in favor of automated data profiling tools that take much of the guesswork out of finding and identifying problem data. These tools are now available from a variety of data quality vendors such as Ascential Software, DataFlux, Evoke Software, Firstlogic, Informatica and Trillium Software (which recently purchased Avellino Technologies).