One of the attributes of the data warehouse environment is that it contains history; thus, the structure of the data warehouse reflects that dimension of the data.
The first implication of this attribute is that data in the warehouse is stored in units of snapshots. Once the snapshot has been taken and recorded properly, there is no need to ever go back and revisit it. The second implication of the data warehouse being a historical structure is that many records are placed in the data warehouse; and unlike a transaction-oriented database where currency of data is placed at a premium, there is no distinction made between older data and newer data in the data warehouse.
Data warehouses then have the attribute of operating on and storing historical data. From a structural standpoint, this means that there is some unit of time in the body of the record of each unit of data in the data warehouse. In fact, the selection of a unit of time is an important element of database design for the data warehouse environment.
The first aspect of design for the data warehouse is that of choosing the right time parameter. The right time parameter refers to the unit of time chosen being consistent with the data being measured. For example, it does not make sense to choose year/month/day/second for data calculated for the annual report. The annual report will be made only once a year. Therefore, keeping track of time down to the day or second is a waste of resources and simply doesn't make sense. By the same token, keeping track of Wall Street transactions by the year makes no sense. In a year's time, there are likely to be billions of transactions. Tagging a year onto each transaction is unlikely to mark the data as useful.
Instead, the data needs to be "just right." Fast moving, frequently occurring transactions need to be finely marked, perhaps down to the minute or even the second. Transactions that occur infrequently need to have a time stamp that is very broad -- down to the quarter or even the week.
However they are specified, the database designer must be aware that the time stamp requires space. It is true that space can be compacted to some extent, but a time stamp of YYYYMMDDHHSS takes up more space than a simple time stamp of YYYY. Additionally, that space is required for every record that resides in the data warehouse. Therefore, in the interest of saving space, the designer must choose as compact a version of time stamping as possible.
For records that will occur multiple times, it is possible to group the records so that the physical time stamp or at least parts of it can be derived from the group of records rather than from each record. For example, suppose there are n million transaction records, and the time stamp contains YEAR. Every record for the year 2004 will have - in one form or another - the value "2004." By grouping all records for the year 2004, the designer can forego the need to place the value 2004 redundantly on each record. In doing so, much space is saved. Of course, the designer does not have to stop at grouping records for 2004. The designer can group records for January of 2004, February of 2004, March of 2004 and so forth. By grouping down to the month rather than the year, even more space can be saved.
However, when the technique of grouping is used, the designer must be aware that accessing the data has a new level of complexity. Without a grouped time stamp reading, accessing each record is very simple. Each record contains fully its own moment in time. However, with a grouped time stamp, the person accessing the data must first access the group time and then access the record. While this is not a terribly large or burdensome task for the person accessing data, it does represent a complexity.
Time stamping is a fact of life in the data warehouse environment. In fact, there are many ways that time stamping can occur. Each technique for time stamping has its own set of considerations. Some of the ways that time stamping can occur are: at the moment the transaction occurred (assuming that the record in the data warehouse measures a transaction), at the moment the snapshot of data is taken (or the moment the data passes into the data warehouse) and at the moment the data enters the warehouse.
Time stamping by transaction date assumes that a transaction has occurred. The business transaction date then becomes an important milestone for the transaction and is entirely appropriate as a time stamp.
Another relevant and useful measurement of time is the moment the snapshot is taken. Some data found in the data warehouse is not transaction driven; it is time driven. For example, at the end of the month, a snapshot is taken of all accounts. The trigger for the snapshot for some data is the regular passage of time, not a transaction. In these cases, the moment of the snapshot becomes the appropriate value to place in the data warehouse record. The snapshot date can be the time the snapshot was taken or the time the data actually passed into the data warehouse.
Another form of data that makes its way into a data warehouse is non-transaction-based, non-regularly occurring data. An example is data that describes a customer. The customer data enters the data warehouse not as a result of a transaction or a snapshot, but merely as a byproduct of processing customer data. In such a case, the data that is placed in the data warehouse is time stamped as of the moment the data actually enters the data warehouse.
The form of time stamps takes two general shapes - discrete and continuous. A discrete time stamp is one that makes the statement of a set of values at a particular moment. For example, the discrete time stamp for a monthly accounting could say: on July 31, 2004, account ABC-990 was $10,870.56. On August 1, the account may have received a deposit that brought it to $769,005.25. However, the record for July 31 is completely unaffected by the change in values that occurred on August 1. For this reason, it is very unwise to average values placed in the data warehouse in a discrete manner. For example, a person could take the discrete value found on July 31 and the discrete value found on August 30 and average them. While this is certainly mathematically possible, such an averaging may produce a terribly inaccurate result.
Discrete time stamps are used where there are many records that are fast changing. Consider the Dow Jones Industrial average. Keeping a record of the changes throughout the day is theoretically possible. However, there are so many changes that occur so quickly that no one would even find such a measurement useful. Instead, a snapshot of the closing values is taken at the end of the day. In this case, only one value is used and stored.
In continuous time stamping, there is the implication that a variable (or variables) is continuously valued the same from one moment in time to another. As a simple example, John Jones lived in Apartment B from January of 1997 to April of 2004. One record is created for John Jones for his continuous residency in a single place.
Continuous time stamp records normally have two time stamps - a beginning time and an ending time. There are variations where time stamp records can be continuous and where there can be only one time stamp value on the record. However, when the continuity of a record is kept by looking at two records, the processing required for the accessing of the data rapidly becomes complex.
Different records can be grouped together to form a sequence of time span records that continuously represent changes in a variable over time. For example, suppose the marital status of June Jones was being tracked. The records for June Jones might look like the following:
- July 20, 1965 to January 31, 1985 - single
- February 1, 1985 to August 19, 1996 - married
- August 20, 1996 to present - divorced and single
These three continuous records tell the story of June Jones' marital status from birth to the present. The different records combine to form a continuous representation.
Continuous time span records can be represented as going into the future. The last of the sequence of records shows the value "to present." This means that the record holds the latest known value for the variable. These records hold data that does not change frequently. If the underlying data were to change frequently, many time span records would be created.
There are many other ways that the data warehouse holds time variant data. The ways discussed here are the most typical; however, by no means are they the only ways. Thus, time variancy is one of the skills of database design for the designer and developer of the data warehouse.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access