Continue in 2 seconds

The Data Warehouse and Testing

  • February 01 2002, 1:00am EST

Take a look at your data warehouse environment sometime and stop to consider an interesting fact: If you are like most people who have data warehouses, you don't have a "test" data warehouse. Where are all of those procedures you used to put new programs and databases through before putting them into production? Where is the "test" environment that every self- respecting production/transaction/operational environment has?

In the classical online transaction processing (OLTP) environment there was/is the practice of putting everything that was newly developed through the "test" environment. Once the new OLTP code was tested, it was put into production. However, as the world has evolved into the decision support system (DSS) data warehouse environment, the "test" environment seems to have dropped by the wayside.

Why? As with most complex questions, there is no single answer. Instead there is a series of answers.

The first reason why the data warehouse environment does not have a mirror test environment is because of cost. Data warehouses cost enough money as it is, with terabytes of data and the processors and software needed to control and manage those terabytes. Imagine saying to management, "By the way, we need to double the size of our data warehouse environment so that we have a place to test." There is not a manager alive that relishes the thought of spending money like that.

The second reason why you don't find test data warehouses is because of the fact that there are no programs to be tested. The data warehouse environment is fundamentally different from the operational environment. In the operational environment when a new system goes live, there are programs and databases to be tested; but when a new subject area is to be added to a data warehouse, the data is simply added. There are no programs needed to operate on the data once the data has reached the data warehouse. The data warehouse environment is fundamentally different from the OLTP/transaction-processing environment in this regard.

A third reason why you don't find test data warehouses is that when built properly, data warehouses are built in small iterations. Once the data is initially loaded into the data warehouse, it becomes a part of the data warehouse. The only thing that needs testing is the new data being added incrementally. If testing is to be done at all, it needs to be done on the newly added data, not on the entire infrastructure.

Does the fact that there is no separate test environment for the data warehouse environment mean that testing does not need to be done at all? The answer is no. There needs to be testing and quality assurance for the data warehouse environment. The first place testing needs to occur is with ETL processing. ETL processing occurs as data is passed from the operational, transaction environment to the data warehouse, informational environment. With ETL processing, data undergoes a profound and usually complex set of transformations. The logic and code that represent that complexity certainly need to be tested. It is during ETL processing that there is the best chance to proactively address the issue of data quality.

The second place that testing needs to occur in the data warehouse environment is on the data that has been loaded into the data warehouse. Because data in the warehouse is loaded over time ­ in many cases, over a significant amount of time ­ it is only natural that changes to the business conditions underlying the data change. As those business conditions change, the data beneath the business conditions also changes. That data needs to be periodically tested, modified and verified inside the data warehouse.

Most organizations do not discover the need for testing in the data warehouse environment until the second or third iteration of the development of the data warehouse. When an organization is building the first iteration, the interests and attentions of the developers are consumed by lots of things other than testing, such as:

  • What should the database design look like?
  • How should every record be time-stamped?
  • How can the costs of the data warehouse be justified?
  • Who should be leading the development effort?
  • What tools should be selected?
  • How big and how fast will the data warehouse grow?

In the excitement and the newness of the development of the first iteration of the data warehouse, no one pays much attention to the quality of data in the data warehouse. It is when users begin accessing the data warehouse that the developers discover the importance of testing data warehouse data. On the next iterations of development, the issue of testing will not be taken lightly.
In the same vein as the issue of testing and data quality is the issue of whether data in the data warehouse should mirror data found in the operational environment. The answer is no. Data found in the data warehouse should not merely mirror or duplicate data found in the operational environment. There are a myriad of reasons, such as:

  • In the transaction-processing environment, there are two systems, system A and system B. System A operates on a transaction- processing calendar that is based on the calendar. System B operates on a transaction- processing calendar that is based on the corporate year. As data is passed to the operational environment, a different calendar year is imposed on the data. The monthly summarizations of activity from the operational environment to the data warehouse environment will not be the same.
  • System A uses one bill of materials. System B uses another bill of materials. When data is placed in the data warehouse, the data is converted to yet another bill of material. The manufacturing summaries for the operational environment and the data warehouse environment will never be in sync.
  • System A uses one territory for salespeople. System B uses another territory for salespeople. When data is passed to the data warehouse, the data is converted to yet another set of sales territories. The transaction data found at the application level will never be in sync with the corporate data found in the data warehouse.

When the user complains that the data found in the data warehouse is different than the data found in the operational application environment, there needs to be a short education process as to the difference between corporate data and application data. Differences in values between the two environments are not necessarily a sign of lack of quality of data.
Remember, testing still needs to be done in the data warehouse environment, but testing takes on a completely different character in the data warehouse environment than it does in the classical OLTP environment.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access