"Haste makes waste" is as true in data warehousing as it is in life. The faster we try to go, the longer it takes to arrive at our destination. When we rush, we fail to plan sufficiently and make careless mistakes.
I'm still amazed how many data warehousing projects gloss over the all-important step of scrutinizing source data before designing data models and ETL mappings. For example, one company designed its data model and extract, transform and load (ETL) scripts around a database that contained 11.5 million customers, only to realize later that it actually held only 1.5 million distinct customer records. This forced the firm to rewrite its target models at great expense.
Code, Load and Explode!
Many teams make the mistake of thinking that they understand their source data because they work with the transaction system that generates it. They are in such a rush to begin the project that they trust their assumptions about the source data rather than undertake the traditionally tedious and time-consuming task of interrogating the data. They simply fire off a few SQL queries against a sample of the data to validate its contents, and then they start coding in earnest.
However, these teams often get halfway through their project only to discover that the source data doesn't behave according to their assumptions! The teams find that character fields contain numbers, the gender field has five distinct values, invoices reference nonexistent customers, sales orders have negative values and so on.
Usually, these teams then consult with a subject matter expert and create new rules to fix this "bad" data. They run the process again and the same thing happens - they discover new data defects and again have to stop and rewrite the ETL code. As schedules balloon, costs escalate and tensions rise, they get stuck in an endless loop caused by undiscovered errors that pervade their data sources. One data warehousing manager has described this pernicious cycle as "code, load and explode."
Data Quality Obstacles
According to data warehousing professionals, the top two challenges in implementing ETL tools are "ensuring adequate data quality" and "understanding source data."1 Most data warehousing professionals learn the hard way that fixing unanticipated defects and anomalies in source data is the primary reason for project delays, overruns and failures.
Yet, the "code, load and explode" phenomenon is not relegated to data warehousing; it afflicts any project that tries to integrate data from multiple systems. These projects include customer relationship management systems, supply chain applications and data migration initiatives. Most of the high-profile "failures" in these areas are due to organizations that underestimate the data quality problems in their source data and the time and resources it takes to fix them.
The Limits of Current Practices
Discovering Errors Too Late. Despite the growing publicity about the impact of poor quality data on data integration initiatives, most organizations take only minimal steps to understand - or "profile" - the data they want to integrate. In fact, according to a TDWI survey, almost two-thirds (62 percent) identify data quality problems after the fact when users complain about errors in the new system (see Figure 1). When customers discover data defects in a system, they lose trust in it. If the quality problems are severe and persistent enough, the system can fail from lack of use.
Figure 1: How Do You Determine the Quality of Data?
Most organizations rely on user complaints and SQL queries to identify data defects. Less than one-fifth (17 percent) of organizations use data profiling tools. Based on 640 respondents from "Data Quality and the Bottom Line," TDWI Report Series, p. 21, www.tdwi.org/research.
Unreliable Meta Data. Part of the problem is that organizations rely too heavily on system catalogs (i.e., meta data) to provide an accurate description of the contents of source systems. These catalogs are notoriously unreliable because users often co-opt existing fields to enter new types of information.
For instance, telemarketers may insert cross-sell response codes into a birthday field because the marketing department wants to capture this data and there is no other place for them to enter it. In other cases, programmers who designed the system never documented its contents, or they left the company, taking their knowledge with them. Additionally, where catalogs do exist, administrators rarely keep them up to date.
Also, data entry errors and poor validation routines permit errors to enter the system. Because many errors don't affect day-to-day operational processes, managers assume the data is accurate and consistent when it isn't. They only discover these "latent" errors when they try to integrate supposedly identical fields in two or more systems and nothing matches up.
Manual Profiling. Another reason that analysts fail to discover data quality problems in advance is because they use manual methods to profile the contents of source systems. According to the previously referenced TDWI survey, a majority of organizations (57 percent) issue SQL queries to analyze source data. This usually means that they simply sample data in a few key fields to get a sense of what the data is like in those columns.
The hypothesis-driven approach takes too long and is error prone. It depends on the skill of the analyst to write accurate SQL and understand the business and its systems well enough to know what to look for in advance. This approach is like drilling for oil without a geologic map - there's little chance you'll find a deposit even if it's right underneath your feet!
The Answer: Automated Profiling Tools
To better understand source data and minimize risks in data integration projects, smart organizations are abandoning manual methods in favor of automated data profiling tools that take much of the guesswork out of finding and identifying problem data. These tools are now available from a variety of data quality vendors such as Ascential Software, DataFlux, Evoke Software, Firstlogic, Informatica and Trillium Software (which recently purchased Avellino Technologies).
Unlike user-generated SQL queries, data profiling tools scan every record in every column and table in a source system. Instead of just spitting out a list of data values, data profiling tools generate reports full of statistics and charts that make it easy to understand everything you ever wanted to know about your data. Thus, a data profiling tool is more likely to expose new or unanticipated structures and values in the data than manual profiling methods.
In general, data profiling tools enable users to better understand: 1) the structure and meta data of target systems, 2) the range and distribution of values in each column, and 3) relationships between columns in one or more tables, including primary/foreign key relationships. Also, many data profiling tools let analysts drill down from summary views of the data to actual data values to get a better sense of what records may or may not be skewing the summary results.
The benefits of data profiling tools are enormous. One user from a high tech firm said a data profiling tool enabled them to evaluate 100 percent of the data - 60 million records, 22 tables and 500 columns - in a matter of days compared to less than half the data in three to four weeks using manual methods. Moreover, he said the data profiling tool generated substantially more information about the data, accelerating the process of analyzing the data and creating appropriate data cleansing rules.
Another company, the Automobile Club of Southern California (ACSC), attributes a data profiling tool with saving a data warehousing project that was plagued with data quality problems and mired in the "code, load and explode" phenomenon. According to their data warehousing manager, the data profiling tool enabled ACSC to analyze two mainframe systems in six weeks with 100 percent accuracy, something it had not successfully done in 10 months using manual methods.
Maintaining Accuracy in Existing Systems
In addition to analyzing data prior to integration projects, companies also use profiling tools to audit the cleanliness of existing databases. For example, a marketing manager at one company periodically "profiles" a marketing database to ensure it has no missing values or duplicate records. This regular audit gives her confidence to rely on the data to analyze and execute marketing campaigns.
Auditing Third-Party Feeds
Many organizations are also using data profiling tools to examine the accuracy and contents of data feeds submitted by syndicated content providers, customers, suppliers or affiliated agents. For example, a mortgage provider uses a profiling tool to compare its term sheets (rates and plans) against those being used by its partners. It runs the profiling tool against its partners' databases across a wide-area network to ensure they're running the most up-to-date term sheet. In addition, some firms use data profiling tools to examine syndicated data to ensure it provides accurate data that is worth purchasing and can be integrated into their existing data models.
Exposing Inconsistent Business Processes
Some organizations use reports generated by data profiling tools to convince executives to standardize key business processes. The reports depict data inconsistency across business units, where each unit uses different organizational structures and product codes. The reports provide top executives with concrete evidence that nonstandard business processes will make it difficult to deliver a single version of the truth, a key strategic objective.
In many ways, a data profiling tool provides insurance for data integration projects: they mitigate the risks posed by poor quality data and give you confidence that you can meet budgets and schedules as planned. More importantly, profiling tools enable you to deliver "trustworthy" data to end users so they can make accurate and effective decisions.
The Data Quality Process
Although data profiling is an important step in managing data quality, it is not the only one. Organizations that want to deliver high-quality data must not only analyze the data, but clean and monitor it on an ongoing basis. In other words, data profiling tools must be implemented within the context of a total data quality program.
The 2002 TDWI report titled, "Data Quality and the Bottom Line," describes a nine-step methodology for ensuring high-quality data.2 The first three steps focus on organizational issues, such as launching a data quality program, developing a project plan and building a data quality team. The next three steps involve assessments - analyzing business practices, data architecture and source data (i.e., data profiling). The final steps entail taking action based on knowledge gained in prior steps: clean the data, change business processes and monitor data quality. Thus, data profiling is an integral part of a data quality methodology.
In a data warehousing project, data profiling and cleansing processes should be integrated with ETL tools. The profiling tools pass meta data to cleansing tools, which fix errors and spit out a clean file to the ETL tool for final processing. Again, this level of integration requires close coordination between data quality and ETL tools and is best achieved if a vendor owns both sets of tools.
Evaluating Data Profiling Tools
Once you understand how to analyze, clean and monitor data, you may wonder which data profiling tool you should purchase. The good news is that today there are more data profiling tools on the market than there were several years ago. However, not all data profiling tools work the same way, so you will need to carefully evaluate their capabilities.
Ideally, you should prototype a tool by running it against a subset of your data that you know well. This will help you understand whether the tool captures key characteristics in the data and how easy it is to use.
To help you in your quest, following are a few data profiling characteristics you should evaluate:
Ease of use. Can a business analyst use the tool without having to rely on someone from IT to run the software and generate the reports? The use of wizards, side-by-side panels that display summary and detail data, and visual workflows may indicate that the tool is appropriate for business users.
Collaboration. Good profiling tools educate the wider community about the nature of data that comprises a new or existing application. That means the tools should be able to generate easy-to-read reports in various formats (e.g., HTML, Word, Excel) that quickly communicate data quality issues to relevant constituents.
Direct connectivity to sources. Data profiling tools should be able to connect and query source systems directly to ensure that analysts are working with up-to-date values. Data profiling tools that connect directly should also be configurable so that administrators can throttle back the number of query processes or schedule tasks to run during off hours to avoid bogging down operational systems. These direct-connect tools should also be able to access and query multiple databases simultaneously so analysts can assess and correlate data values across systems. Finally, the tools should be able to access both relational and non-relational data sources and run asynchronously from client machines.
Copy data to a repository. A data profiling tool should be able to copy data to a mid-tier repository and perform data analysis there. This is valuable when source system owners severely restrict access to source systems, up-to-date data isn't required or may actually create consistency problems, and/or the profiling tool needs to perform complex joins among data values from multiple heterogeneous systems. If the data volumes are large, this will consume a significant amount of network bandwidth and disk on the mid-tier server, but it may be preferable to trying to perform the analysis directly on the source systems.
Generate rules for cleansing. It is not enough to identify data defects; analysts must be able to create rules to fix the problems. Ideally, analysts should be able to create rules within a data profiling tool, which can then pass the rules to a data cleansing tool for processing. In other words, data profiling tools should be tightly integrated with data cleansing tools to make the most efficient use of analysts' time.
Integration with third-party applications. To automate routine data integration or migration processes such as data warehousing loads, data profiling tools should be integrated with data cleansing tools, and these should be embeddable within data integration processes, such as those defined by ETL tools. The profiling and cleansing jobs should be initiated and managed by these third-party applications.
Offer broad functionality. Data profiling tools should offer a comprehensive set of functionality for analyzing data structures, generating statistics about column values and mapping dependencies within or among tables.
Price. Organizations are reluctant to purchase multiple tools to perform data integration projects. One reason many organizations rely on manual profiling is that they can't justify making a capital expenditure when they can use internal resources (i.e., programmers or systems analysts) to profile data manually. Given this context, data profiling tools need to provide excellent value for the investment. Ideally, they come bundled within a suite of data quality or data integration tools so customers need only purchase a single tool for analyzing, cleansing, transforming and loading data in migration projects.
As the "profile" of data profiling tools increases, they are becoming an invaluable part of every organization's data integration toolbox. Data profiling tools quickly and cost-effectively ferret out inconsistencies and errors in source data in advance of critical modeling and mapping exercises. Armed with a data profiling tool, your team gains greater confidence that it can meet project deadlines and deliver trustworthy data.
During the next several years, we will see data profiling integrated more tightly within data cleansing and ETL tools. This will consolidate and streamline the process of detecting and fixing data quality problems in target databases and enable organizations to focus more on addressing business issues than data.
1. See "Evaluating ETL and Data Integration Platforms," TDWI Report Series, 2003, page 20. www.tdwi.org/research.
2. To download this report, visit: http://www.tdwi.org/research/display.asp?id=6064.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access