I'm looking for some input on building and maintaining the data warehouse development and testing environment. This environment will be used for development, unit testing, some integration testing, maybe some stress testing. So far, our warehouse has been small enough to get away with using a copy of the production database for testing, but as we've been growing, we've been running up against space problems. What I'm looking for, in particular, is:

  • How you populate your development database (full copy of production, subset of production, totally fictional, etc.)?
  • If it's a subset, how is it extracted, do you write an extract that contains a myriad of rules to ensure data integrity?
  • If it’s totally fictional, how is it created?
  • Timing – do you only include the latest x-number of periods? What happens when you have to regression test? What happens when you have reports that look at many years of data?
  • How is it used (backups, restores, individual tester schemas, baselines, regression testing, etc.)?
  • Does every developer need his own schema for updates and only use the master schema as read only?
  • What are the advantages and disadvantages of different approaches?
  • Should we use the same strategy or environment for the integration, verification and stress testing environments?


Les Barbusinski’s Answer: Generally speaking, your development/testing environment should be a statistically significant subset of your production environment…plus any new/modified entities that will be in the next release of your data warehouse. Achieving this ideal is never easy. Unit testing is usually performed against test tables created within the developer’s own schema and tablespace. This much is simple. The hard part is creating a representative environment for integration and/or stress testing. There are tools on the market that can generate test data, but (at least in my experience) they’re rarely able to synthesize complex inter-table, inter-row or intra-row data relationships and/or business rules. Hence, testing with synthesized data can mask a lot of logic errors. The best method I’ve seen is to do full extracts of small tables (< 100,000 rows) and sample extracts of the larger tables (e.g., extracting every 5th or 10th row). These extracts would then be loaded into the test database in RI (referential integrity) sequence, with FK constraint errors being discarded. Once initialized, the test database is copied to a “baseline” backup. Developers coordinate between themselves when using the test database (updates are allowed). Periodically, the test database is restored from the baseline for regression testing or for general clean up. In addition, the baseline version is periodically restored, and the test database is altered to add/modify and populate tables for the next release of the data warehouse. This is then backed up to a new baseline. The test database can occasionally be augmented with additional data from production to perform limited stress testing (i.e., focused on a subset of tables).

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access