The most unpredictable part of building a data warehouse is identifying the content and structure of legacy data. Too many data warehousing projects have foundered upon the shoals of invalid and unusable source data.

The Problem

The problem is that most project teams don't discover fatal flaws in their source data until after they have invested time, money and programming resources building target data models, source-to-target mappings, transformation rules and extract feeds. The teams then scramble to rewrite these programs and start again. But their next round of validation checks again reveals unexpected problems with the source data. Some project teams cycle through this process many times and wind up never delivering the data warehouse!

One data warehousing manager with whom I recently spoke said his company canceled an earlier data warehousing project because of data quality problems. "We found 85 percent error rates and critical fields we couldn't fix," he said. "We got caught in a bad loop of having to rewrite our [extraction, transformation, load] programs. We had to yank the project."

Like most data warehousing developers, this manager and his team used manual methods to inspect source data. They checked legacy copy books and did random queries against their core database which maintained 750,000 customer policies. Everything looked clean until it became time to transform and load source data. But by then it was too late. They fell victim to the "code, load and explode" syndrome.

Search for Answers

Why are we still using cumbersome manual methods to identify the true nature of source data when we've automated just about every other phase of the data warehousing life cycle? Why are we relying on legacy copy books, which are typically outdated and inaccurate, to tell us what's in the source data? Why are we reacting to source data problems instead of proactively addressing them?

Even Ralph Kimball who, along with several colleagues, authored The Data Warehouse Lifecycle Toolkit, considered the bible for data warehousing managers, admits that there is a problem. He writes: "The extract process always exposes data quality issues that have been buried within the operational source systems" (P. 35) and later on, "The source data variable is the one most likely to cause uncontrollable scope creep." (P. 55)

Unfortunately, Kimball and colleagues offer no surefire remedies to address this problem, other than to expect errors and delays and then manage them. Some may think data cleansing tools are the answer, but they don't go far enough. Most cleansing tools scrub data according to user-defined rules, which are based on users' assumptions about the data ­ not the data itself.

The Solution

What's needed is a tool that parses all elements in a source data file and checks them against all other elements in that file to create a comprehensive atlas of source data content, structure and quality. This atlas would expose all anomalies in the source data, surface unknown business rules and give data warehouse designers confidence that their data models, mappings, and extract and transformation programs will work the first time.

Fortunately, I found such a tool, and it's worth its weight in gold. I usually don't tout products in this column, but I'm going to make an exception. Evoke Software's Migration Architect should be part of every data warehousing manager's toolbox, along with data cleansing, extract/transformation and data modeling tools.

Migration Architect uses algorithms to profile data within a single column, compare data elements across columns in a single table and compare data elements across tables. Evoke calls these three steps column profiling, dependency profiling and redundancy profiling, respectively.

Column profiling infers detailed characteristics of each column including data types and size, range of values, frequency and distribution of values, cardinality, null values and uniqueness characteristics. Dependency profiling identifies primary keys and whether expected data relationships are supported in the data. Redundancy profiling infers foreign keys required to connect tables and shows which columns contain overlapping sets of values.

From there, Migration Architect can build a third normal form schema of your source data. You can then modify the schema and generate the DDL to create your data warehouse database. Finally, Migration Architect can map data from source to target schema. You can hand this mapping to your ETL programmers for coding.

Benefits

The benefit of Migration Architect is that it helps ensure that your target schema is fully supported by the source data. Migration Architect surfaces missing data elements and identifies which elements need to be cleaned and transformed and how. This prevents the endless rewriting of ETL programs and provides a predictable deployment schedule.

The manager quoted earlier eventually purchased Migration Architect to build a sales report that had been under development for two years but had never been released because of data quality problems. In the first hour of using Migration Architect, the team discovered that the history of 25,000 policies had been inexplicably truncated. In six weeks working part-time, the team fixed all the data quality problems and rolled out the sales report.

The real beauty of Migration Architect, however, is that it provides a vehicle for business and systems analysts to discuss and understand the data. While looking at the data together, a systems analyst might ask a business analyst, "Every employee is supposed to have a unique identifier, but in 15 percent of the cases, this is not true. What's happening?" Conversely, the business analyst might ask the systems analyst, "Why are social security numbers stuck in a field that is supposed to contain birthdays?" This type of dialogue between business and systems experts is absolutely critical to unearth the true content and structure of source data.

Going through source data row by row is the hard work of building a data warehouse. Because it's physically impossible to do this for large databases, most companies take necessary shortcuts and suffer the consequences later. Although the product is pricey ($200,000+) and not very flashy, Evoke's Migration Architect helps automate the analysis of source data, minimizing unexpected data quality problems that can sabotage a data warehousing project.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access