Building a successful data warehouse depends not only on choosing the right hardware and software, but also on understanding and verifying your data before you begin your project. By adding a data validation step during the design phase of data warehouse projects, companies can actually save time during the coding and testing of warehouse reports. Time savings can be realized in areas such as: discovering data that is missing or the wrong type; discovering differences between data and the documented business rules; and doing "what if" analysis on a sample of the data.

Continental Airlines contracts with EDS to provide support for their data warehouse projects. Various groups in Continental have requested that a Customer Information System (CIS) data warehouse be constructed to help them better understand the needs of their customers. Continental will leverage several projects from the data in the data warehouse. This article talks about the roll of data analysis in enhancing the success of two of these projects.

The data for the data warehouse is being taken from an Oracle production database on a Sequent NUMA-Q 2000. When the data warehouse is completed, it will reside in a Teradata database on an NCR machine with a Redundant Array of Independent Disks (RAID) level 5. A Prism product will extract, migrate and scrub the data. All of the On-line Analytical Processing (OLAP) decision support is being done in the Teradata data warehouse.

One of the projects that EDS is working on for Continental is a Customer Operational System (COS) to analyze individual customer travel patterns. Prior to the creation of COS, Continental had access to data organized according to travel destination and origin, as opposed to an individual traveler's itinerary. Continental feels that understanding each individual customer's travel patterns will help them to compete more effectively in the deregulated environment.

An EDS team led by John Brinker, project leader for CIS, started the design of the COS project with a Joint Application Development (JAD) session where the business rules were discussed and database design and procedures were planned. Before they began to code the procedures, they decided to take an added step up front, data validation. For this phase of the project EDS used Migration Architect from DBStar.

The project required consolidation of data from multiple diverse data sources including DB2, VSAM, IMS and Oracle. The Migration Architect offers automated analysis of large amounts of data that otherwise could not be analyzed economically in the scope of projects today.

Data was brought into the Migration Architect in two ways: as a flat file from the mainframe databases with information from the copy libraries supplying the meta data and directly from an Oracle database. Migration Architect has a built-in connection to Oracle databases that allows the user to access tables directly. This greatly reduces data extraction efforts. Migration Architect reads the header information and stores the meta data it contains for comparison during the analysis. Once this information is in place, a comparison is made between the meta data and the actual data.

EDS used the domain profile feature of the product to analyze and validate the data. This comparison is called a domain profile attribute summary and it includes:

  • Total rows (records) reviewed
  • The data type observed from the data
  • The null rule observed from the data
  • Number of distinct domain values
  • The percent of distinctness of the domain values for an attribute
  • The minimum and maximum values of each attribute (column)
  • A detailed comparison between the documented meta data in the observed data

If an attribute is found to have 100 percent unique values, then it is valid to use that field as a key. The production data has multiple tables, some of which would be consolidated into a single table. Based on the business rules discussed during the JAD session, planners hoped it would be possible to tie refund data back to individual tickets. They had two tables which both contained the attribute "ticket number" and expected to use this field as a primary key for each of the tables and, therefore, consolidate the tables using this attribute. However, this was not possible.
"The analysis uncovered the fact that some of the other data in that table was either bad or not what was expected," said Lori Getz, advanced system engineer for EDS. "This led us to decide that it would have been a mistake to use that particular table in the coded procedures."

Another benefit of the data analysis was the discovery that the planned indexing scheme would not work properly based on the data. The indexing plan was modified before they had done any coding or testing of report procedures. EDS was also able to give Continental more accurate time estimates. For example, they were able to tell how long it would take to build an operational COS based on the types of data Continental wanted to include and how much effort was needed to transform the data before it would be useful. This allowed the project planners at Continental to make informed decisions about the design. With the new information in hand, EDS went forward and completed the design of the data warehouse.

"I estimate that we saved at least two weeks on this project by analyzing the data and creating a design based on validated data and business rules," said Getz.

A second project that Continental is working on is revenue management, which is also based on data in the Teradata data warehouse on an NCR platform. The data warehouse has 35 tables with approximately 350 million rows of data and will continue to grow.

Keith Jones, manager of data warehousing for Continental, used the Migration Architect and a sample of the data warehouse data to validate the design of this project. This sample contained nine tables with 156 attributes and over 15,000 rows of data. Jones wanted to validate the physical characteristics of the attributes, the quality of the keys and some complex relationships. See Figure 1 for a breakdown of the time spent on the project.

"I already understood the data model, so I didn't learn much about the relationships," said Jones. "However, I was able to validate that the data loaded matched our expectations."

"Additionally, I did some 'what if' analysis on remodeling the key structures using the 'Show Conflicting Data' view," Jones continued. "This provides an easy way to verify key cardinality. Also, the data profile provides value-skew information that is terrific input to the physical modeling process."

During analysis, several attributes with questionable domains were quickly found. The business experts will address these discrepancies. It was discovered that most of the keys were valid for the data being analyzed. Validation was confirmed for some complex keys that were based on data structures built to accommodate complex flight schedule tracking.

When Jones was asked about the time saving realized on these projects compared to other projects done without the Migration Architect he said, "It's hard to say exactly how much time we saved. It's like comparing apples and oranges. Before doing a project we would have always analyzed the data to understand the structure and find the key fields, but we got information from using Migration Architect that we never would have tried to go after manually because it just would have taken too much time."

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access