One of the major challenges facing data warehouse designers is determining the data elements to include in the data warehouse and, almost as importantly, the ones to exclude. A factor that makes this decision difficult is that everyone involved--the developers and the users--is certain that once the data warehouse is built, new applications for it will be discovered. With this in mind, the immediate reaction is "when in doubt, put it in." To understand the danger of following this approach, let's look at the impact on the data warehouse project and on the subsequent operation of the data warehouse itself, using a practical example.

Data Warehouse Impact

During information gathering sessions, the marketing department may indicate that it needs to obtain sales information by customer location. Armed with this request, the data warehouse team would analyze the source systems for information that would identify the customer location, and it may discover that three different locations (mailing, usage and billing) are stored. The astute data analyst would discuss the usage of the information with the marketing representatives, and jointly they may conclude that the usage location is the most significant.

This process eliminated two thirds of the data elements but did not go far enough. With multiple source systems, even for the single address, the information could be stored in different ways. Additional analysis of the business needs could uncover the fact that most of the location-related analysis is done at a ZIP-code level. With this knowledge, the data to be retained about the address has been reduced to a single element.

In this example, reducing the address information to a single element had a profound impact on the development project. From a modeling perspective, only a single element needs to be named, placed within an entity and defined. The transformation logic is also simpler. With the focus on the ZIP code, for example, the data warehouse developer does not need to be concerned with the potentially unique representations of the street address within each system.

The operation of the data warehouse is also streamlined. On a regular basis, data is loaded into the warehouse. If three full addresses are captured, the amount of data that would need to be extracted, processed and loaded could easily exceed 300 bytes. With the focus on the ZIP code, only a few bytes need to be extracted, processed and loaded. As the number of customers increases, the impact of the extra data could endanger the timely execution of these data warehouse processes.

The size of the data warehouse is also impacted. A company with three million customers would be storing an extra gigabyte of data initially. If it incurs a churn rate of 25 percent, another gigabyte of data would be stored every four years.

Iterative Methodology

The data warehouse is developed in an iterative manner. The decision being made, therefore, is not whether or not the element will ever be in the data warehouse, just whether or not it will be incorporated as part of a particular iteration. Factors that should be considered include the likelihood that an element would be needed in the future, the impact of not having it, the viability and relative cost of capturing it at a later date, the cost and time of capturing it now, and the delay that capturing it would impose on completing the iteration. Armed with this information, the data warehouse project manager and key business representatives can determine which data elements are captured in which iteration and which data elements are not captured at all.

Corporate Information Factory Architecture

The Corporate Information Factory architecture facilitates the iterative approach. Within this architecture, the atomic data warehouse is typically designed using relational modeling techniques and is the source of data for the data marts. The data marts are usually designed using dimensional modeling techniques to facilitate direct access by end users.

The relational nature of the atomic data warehouse facilitates the subsequent insertion of additional data elements. Based on the logical model from which the data warehouse model is derived, the location of the data elements can quickly be determined. Once they are added to the atomic data warehouse, they become available for use in those data marts that support the business functions that need the new elements. Users of other data marts are not impacted by the change.

Data warehouse designers must resist the temptation to bring data elements into the data warehouse just in case they may ever be needed. The addition of these elements significantly lengthens the development cycle; increases the run time of the extract, transformation and load programs; and generates excess data within the data warehouse itself. The iterative methodology used for building the data warehouse, combined with the Corporate Information Architecture, provides a basis for determining when, if ever, a particular data element should be included within the data warehouse.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access