Q: What are some best practices for validating data in staging tables after the ETL process has moved the data from the source to the warehouse staging tables?

Chuck Kelley's Answer:

I personally don't like to validate during the move from source to staging tables. I prefer to download the data as fast as possible so not to "bother" the source system. From there, I do all the cleansing and transforming.

However, it depends on the type of data you are ringing down. Doing the above approach may mean that you have to keep a lot of data in the staging layer that you may not have to if you did some basic checking from the source system. This is especially true if you are only bringing down changed rows.

Evan Levy's Answer:

Some of the practices I recommend include:

  • Conduct initial domain/range checking on source data. This includes quantification of specific unique key values before and after ETL processing.
  • Establish a business-centric data check. This might include something like comparing/contrasting daily transaction or dollar volumes against the loaded DW data.
  • Measuring and recording daily (assuming daily ETL) data quantities. This should be matched against the data volume levels of the source systems.
  • Tracking and quantifying the number of null or un-expected values. This might include all of the fields that don't correlate to an expected value (such as valid addresses, valid zipcodes, valid phone numbers, nulls, etc.)

Clay Rehm's Answer:

  • Obtain the business rules of this data. If it is already documented, great, otherwise interview your business partners to collect, verify and document the business rules.
  • Once the business rules are collected and documented, create test plans for every possible scenario for each and every business rule. You must do this with your business partners.
  • In the test plans, there must be a section to display what the data looks like before and after a process, such as staging.
  • During testing, IT and the business partners will perform similar but separate tests and report the results to the entire team for resolution.

Tom Haughey's Answer:

1. Validate against the tables in the DW itself. This is useful when it is necessary to validate against history, assuming the history is too vast to store.

2. Staging shrunken versions of the DW tables. This usually involves storing all the rows but only the few, necessary columns. These types of tables can be used for referential integrity checking, for the conversion of natural keys to surrogate keys, and for the creation of some aggregates. Design covering indexes so that the queries can be run as index-only queries for speed.

3. Staging full history tables. This involves storing in staging tables the necessary past records from the DW so that history, including historical aggregates, can be created. If you do this, you have the same data, which is often voluminous, in multiple databases - namely, the staging databases and the DW database.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access