JUN 1, 2007 1:00am ET

Related Links

Which CDC method is the best to achieve staging database with changed data?
March 7, 2008
Apart from bloated dimension, what are the negatives of using all known attributes in your SCD?
March 7, 2008
When is it better to have normalized data to create data marts and when is it better to have dimensional data?
March 7, 2008

Web Seminars

Go with the Flow – The Game-Changing Impact of Clickstream Analysis
Available On Demand

Can you provide an-depth process/procedure for performance testing with respect to ETL/reports?

Print
Reprints
Email

Question: I am responsible for data warehouse testing and I'm involved in arriving at a process/procedure for performance testing with respect to ETL/reports. Everything I find is at a very high/abstract level. Can you please guide me?

Joe Oates' Answer: This is not the forum for an in-depth discussion of data warehouse testing, but I hope my answer will help you.

There are two major types of testing that should be done for a data warehouse prior to it being put into production. The first is functional testing. The second is performance testing. Since most organizations know how to do performance testing, I will only cover functional testing in this answer.

Functional testing is done to make sure that each component does what it should and provides correct output. In a data warehouse, functional testing is done for the ETL, canned reporting and typical queries.

ETL consists of three components: extraction, transformation and loading. Some things to test for the extraction component included: 1) validate the source-to-target data mapping and ensure that only mapped source fields are extracted from the source systems, 2) ensure that each source field is ETL'd only one time, 3) ensure that the incremental interval (the to and from dates) are loading the correct number of rows from each source system table (the datetime data type may cause you problems), 4)each automated job should start within a correct time window and "time out" if the source file is not available, 5) error messages should be generated and automatically forwarded to a person if a job fails to complete correctly.

Some things to test for the transformation component include: 1) ensure that surrogate key generation for dimensions is being properly done; 2) ensure that the slowly changing dimensions are working properly; 3) ensure that the correct number of rows for a dimension are produced if you are joining more than one source table in the transformation - it is easy to have duplications if the "where clause" is not correct; 4) currency and other similar types of conversions are being done correctly; 5) ensure that source field to data warehouse field populations are handled correctly, e.g. field truncation's, improper padding, etc.; 6) fact table surrogate key transformation is being handled properly; 7) each of the transformation jobs either completes successfully or produces an error message to that indicate something is wrong.

Some things to test for the loading component include: 1) ensure that new data is bulk loaded and not inserted - it can take an unacceptably long time to insert hundreds of thousands or millions of rows into the database; 2) all dimensions are loaded before any fact tables; 3) the unload task completes successfully or produces an error message to indicate that something is wrong.

For canned reports: 1) check that the reports access the proper data warehouse tables and columns and check that all aggregations and formulas are being calculated properly; 2) for aggregations in formulas that are used by several reports or that will be used by ad hoc queries, have a special table(s) that holds the aggregate numbers and/or formula calculations so that each separate producer and ad hoc query producer does not have to know the details of how to aggregate or how to compose the proper formula; 3)compare and the reports that the data warehouse produces against a similar or the same source system production reports.

For ad hoc queries: 1) keep them simple - many levels of nested queries result in errors as these are nigh impossible to verify; 2) used aggregations and the results of formulas from the table described above rather than have the complex aggregations or formulas as part of the ad hoc query.

Unfortunately, I am not aware of any free in-depth tutorials or guides for data warehouse testing. There are some companies that offer courses that include data warehouse testing. I hope that this helps.

Joe Oates is an internationally known speaker, author and consultant on data warehousing. Oates has more than 30 years of experience in the successful management and technical development of business, real-time and data warehouse applications for industry and government clients. He has designed or helped design and implement more than 30 successful data warehouse projects.

Filed under:
ETL

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.