JUN 1, 2007 1:00am ET

Related Links

When Fast is Not Enough
July 18, 2008
TopQuadrant Software Imports Email MetaData into Semantic Applications
March 26, 2008
An Open Challenge to the Open Source Community
November 30, 2007

Web Seminars

Data Replication for Real-time (Big) Data Warehousing
Available On Demand
Improving your Overall Analytical Environment by Migrating to a New Data Warehouse Platform
Available On Demand
The Dynamic Duo of Data Warehousing and Real-Time Streams
Available On Demand

Strategies for Testing Data Warehouse Applications

Print
Reprints
Email

Businesses are increasingly focusing on the collection and organization of data for strategic decision-making. The ability to review historical trends and monitor near real-time operational data has become a key competitive advantage.

This article provides practical recommendations for testing extract, transform and load (ETL) applications based on years of experience testing data warehouses in the financial services and consumer retailing areas. Every attempt has been made to keep this article tool-agnostic so as to be applicable to any organization attempting to build or improve on an existing data warehouse.

Testing Goals

There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions. Given the importance of early detection of software defects, let's first review some general goals of testing an ETL application:

  • Data completeness. Ensures that all expected data is loaded.
  • Data transformation. Ensures that all data is transformed correctly according to business rules and/or design specifications.
  • Data quality. Ensures that the ETL application correctly rejects, substitutes default values, corrects or ignores and reports invalid data.
  • Performance and scalability. Ensures that data loads and queries perform within expected time frames and that the technical architecture is scalable.
  • Integration testing. Ensures that the ETL process functions well with other upstream and downstream processes.
  • User-acceptance testing. Ensures the solution meets users' current expectations and anticipates their future expectations.
  • Regression testing. Ensures existing functionality remains intact each time a new release of code is completed.

Data Completeness

One of the most basic tests of data completeness is to verify that all expected data loads into the data warehouse. This includes validating that all records, all fields and the full contents of each field are loaded. Strategies to consider include:

  • Comparing record counts between source data, data loaded to the warehouse and rejected records.
  • Comparing unique values of key fields between source data and data loaded to the warehouse. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
  • Utilizing a data profiling tool that shows the range and value distributions of fields in a data set. This can be used during testing and in production to compare source and target data sets and point out any data anomalies from source systems that may be missed even when the data movement is correct.
  • Populating the full contents of each field to validate that no truncation occurs at any step in the process. For example, if the source data field is a string(30) make sure to test it with 30 characters.
  • Testing the boundaries of each field to find any database limitations. For example, for a decimal(3) field include values of -99 and 999, and for date fields include the entire range of dates expected. Depending on the type of database and how it is indexed, it is possible that the range of values the database accepts is too small.

Data Transformation

Validating that data is transformed correctly based on business rules can be the most complex part of testing an ETL application with significant transformation logic. One typical method is to pick some sample records and "stare and compare" to validate data transformations manually. This can be useful but requires manual testing steps and testers who understand the ETL logic. A combination of automated data profiling and automated data movement validations is a better long-term strategy. Here are some simple automated data movement techniques:

  • Create a spreadsheet of scenarios of input data and expected results and validate these with the business customer. This is a good requirements elicitation exercise during design and can also be used during testing.
  • Create test data that includes all scenarios. Elicit the help of an ETL developer to automate the process of populating data sets with the scenario spreadsheet to allow for flexibility because scenarios will change.
  • Utilize data profiling results to compare range and distribution of values in each field between source and target data.
  • Validate correct processing of ETL-generated fields such as surrogate keys.
  • Validate that data types in the warehouse are as specified in the design and/or the data model.
  • Set up data scenarios that test referential integrity between tables. For example, what happens when the data contains foreign key values not in the parent table?
  • Validate parent-to-child relationships in the data. Set up data scenarios that test how orphaned child records are handled.

Data Quality

For the purposes of this discussion, data quality is defined as "how the ETL system handles data rejection, substitution, correction and notification without modifying data." To ensure success in testing data quality, include as many data scenarios as possible. Typically, data quality rules are defined during design, for example:

  • Reject the record if a certain decimal field has nonnumeric data.
  • Substitute null if a certain decimal field has nonnumeric data.
  • Validate and correct the state field if necessary based on the ZIP code.
  • Compare product code to values in a lookup table, and if there is no match load anyway but report to users.

Depending on the data quality rules of the application being tested, scenarios to test might include null key values, duplicate records in source data and invalid data types in fields (e.g., alphabetic characters in a decimal field). Review the detailed test scenarios with business users and technical designers to ensure that all are on the same page. Data quality rules applied to the data will usually be invisible to the users once the application is in production; users will only see what's loaded to the database. For this reason, it is important to ensure that what is done with invalid data is reported to the users. These data quality reports present valuable data that sometimes reveals systematic issues with source data. In some cases, it may be beneficial to populate the "before" data in the database for users to view.

Performance and Scalability

Filed under:
ETL

Advertisement

Comments (0)
A very nice article, it provides good insight of ETL testing. One thing, I would like to state that automation in ETL testing for data transformation testing is the key and something, which can save enormous effort; but I am still not able to find any tool or technique, which can do that effectively. One way as mentioned by Jeff by creating a spreadsheet for input data and expected result seems a good start on this. I think it will also be better if business team can provide those excel as part of business requirement itself.
Posted by Vipin C | Monday, December 26 2011 at 5:37AM ET
Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.