This is the second in a series of articles on things to look for in evaluating a data warehouse data model. The classic definition of a data warehouse is that it is a "subject-oriented, integrated, non-volatile, and time- variant collection of data in support of management's decisions."1 A good data warehouse data model clearly demonstrates the time-variant characteristics of the data warehouse. Strategic analysis often requires information on events that transpire over a period of time, and the data warehouse should retain the needed history. For some events, the date (and sometimes the time) needs to be part of the key, and for others it can be a non-key attribute. If the entity can only occur at one point in time, then the date may be an attribute. If it can occur at several points in time, then the date and time must be part of the key. A sales transaction, for example, occurs at a specific point in time. If each sales transaction is uniquely identified, then the date of the transaction should not be part of the key. Information retained about a customer, on the other hand, may change over time. Therefore, for the volatile customer information, the date needs to be part of the primary key.
Sometimes, the date is not part of the key, even for volatile information. If no history of the customer data is retained in the data warehouse (e.g., the table is refreshed with each load cycle) then, within the warehouse, there cannot be two instances of the same customer. In this case, the date can be a non-key attribute, but the business users must understand the ramifications of this decision. The history retention is ultimately a business decision; and as long as the users understand the limitations imposed by the decision, the model is valid.
For example, in the case of customer information, the business users must understand that they may lose intelligence about subsidiary structures. Additionally, if the customer moves, historical sales to that customer may appear in the wrong sales region (unless steps are taken to preserve the relationship of the sale to the sales region).
An important characteristic of a data warehouse data model, then, is to ensure that each instantiation of an entity contains information that can identify the point in time at which it occurred.
A second date and time attribute that may be retained within the data warehouse pertains to the data acquisition process. This attribute is optional. The data warehouse administration staff may find it useful to retain data concerning extract dates and load dates. These dates are never part of the primary key, as they relate to the data warehouse process and not the data itself. (The snapshot date, which may be the same as the extract date, may be part of the key.)
If these dates are to be preserved, the data warehouse administration staff needs to determine which ones are useful. The extract date provides information on when the data was actually obtained from the source system. The load date provides information on when the data was actually loaded into the data warehouse. These dates may be different, and other dates pertaining to the acquisition process may also be useful.
Dates pertaining to the data acquisition process are particularly helpful for tables that are refreshed. When tables are refreshed, history is lost, and the possibility exists that the same report, executed at different times, may yield different results. When this occurs, the data warehouse processing date helps to explain the reason for the difference. For situations in which tables are refreshed, particular attention must be paid to the retention of information relating to the data acquisition process.
The data warehouse often contains information that is summarized for a point in time or over a period of time. Data summarized for a point in time needs to contain a single date within the key; data summarized over a period of time needs to contain two dates within the key. Sometimes, as is the case for monthly summaries, the two dates can be represented by a single attribute (e.g., month); but, implicitly, there are two dates (e.g., beginning of the month and end of the month) involved.
A common error in creating the data warehouse data model is to combine the two types of summaries within a single entity. For example, an entity may contain the month within the key and include attributes for sales dollar amount, sales quantity, customer additions and customer count. The first three of these pertain to the entire month; customer count pertains to a point in time within the month and should not be included within the same entity.
There are three major ways in which time can be represented within the data warehouse data model. Within each entity, the date of each instance needs be defined. The attribute defining the point in time does not necessarily need to be part of the key. To assist the data warehouse administration staff, date/time information relating to the data acquisition process may be captured and stored. While this is optional, it becomes particularly significant for tables that are refreshed with each load cycle. Data may also be summarized for a point in time or over a period of time. It is very important to recognize the distinction between these two types of summaries.
1 Inmon, W. H. Building the Data Warehouse, 2nd Edition. John Wiley & Sons, 1996.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access