For all of the diversity found in the data warehouse/DSS environment across different industries and different departments, the basic structuring of data inside the data warehouse is surprisingly similar. The basic structuring of a record of data within the data warehouse/DSS environment is a key, a unit of time, primary data and secondary data (optional).
The key is an identifier (which may be unique or non unique) for the record of data. The unit of time identifies (usually) the moment the record of data was created and allows the record of data to be accessed according to the dictates of time. Primary, non-key data is the main focus of the record. Secondary data (which may or may not be contained in the record) is related and incidental data that is gathered with the primary data at the moment in time that the record is created.
This simple definition of the contents of the structure of a record in the data warehouse environment can be organized in a number of ways. One way is in terms of a simple snapshot of data.
In a simple snapshot of data, some business related event has occurred that triggered a record to be written, such as placing an order, making a phone call, completing a shipment, and so forth. The key of the record is sales ID. The unit of time is sales data and time. The primary data relates to the sale itself. Any other data, not directly relating to the sale, that is captured and stored in the record is secondary data.
The next record type found in the data warehouse is a continuous record as shown in Figure 1.
The implication of a continuous record is that the values contained in the record were continuously valid for the length of time between the "from" and "to" date.
The continuous record is quite different from a simple snapshot. In a simple snapshot, no implications can be made about a value before or after the moment of the snapshot. The snapshot guarantees accuracy only as of the instant of the snapshot. Any accuracy before or after that moment is purely coincidental.
On the other hand, the continuous record of data guarantees that all variables within the record were valid and had a single value during the life of the record. This implies that only data that is slow to change be captured and measured by a continuous record of data. If data that is volatile is placed within a continuous time span record, then many occurrences of the record will have to be made every time the variable changes values.
The third type of record commonly found in a data warehouse is an "aggregate" or a "profile" record. Figure 2 depicts an aggregate record.
An aggregate record is made from any number of detailed records, usually from the operational environment. In Figure 2, many transactions have been executed and captured in the operational environment. This is typical of a bank where an account holder performs many activities throughout the month or a telephone number where calls throughout the statement cycle are captured for billing.
An aggregate record is written where there are too many detailed occurrences of data to be placed in the data warehouse. The aggregation of data has the effect of condensing data dramatically. Of course, some level of detail is lost where an aggregation record is created. It is up to the designer to make sure the level of detail lost is not important to DSS processing.
Within the category of continuous time span records are several important subclassifications. The first subclassification refers to the grouping of multiple records which are logically continuous. This means that there is no discontinuity of time from one record to the next. The ending date of one record is exactly one unit less than the beginning date of the next record. If record ABC has an ending date of June 10, then the beginning date of the next record is June 11. Or if the ending date is July, the beginning date of the next record is August. Or if the ending date/time is August 15 at 3:52 p.m., the beginning date/time of the next record is August 15 at 3:53 p.m. At no moment are the variables defined by the continuous records undefined.
Another possibility for the grouping of continuous time span records is the case where there are legitimate gaps of discontinuity. Figure 3 illustrates this case.
In Figure 3, when one record ends, the next record does not necessarily begin at the ending unit + 1. In this example, if the ending date of one record is August 16, the next record may begin at August 27 or September 13 or any date greater than August 16. When there are gaps in the continuity of data, the variables contained within the record are undefined for those moments in time where there is no definition provided by the records.
The one case almost never found is an overlap of time. In a time overlap, the ending date for one record may be September 5, and the beginning date for the next record may be September 1. In the case of overlap, a given variable may be assigned two different values at the same moment in time. This is normally a logical conflict and does not make sense under all normal circumstances.
But multiple record continuity of definition is not the only defining characteristic for continuous time span data. Another defining feature is that of the starting and stopping times represented by the time spans. Figure 4 shows that the beginning date is either an actual value or negative infinity. The ending date can either be an actual value or positive infinity.
When the beginning date is negative infinity, the record is interpreted as having a value from all previous time up to the ending date of the record. When the ending date is positive infinity, the interpretation is that the record is valid from the beginning date onward into the foreseeable future.
Of course, any combination of beginning and ending dates can be chosen:
- Beginning date actual value, ending date positive infinity.
- Beginning date actual value, ending date actual value.
- Beginning date negative infinity, ending date positive infinity.
- Beginning date negative infinity, ending date actual value.
Each of these combinations of values indicates something unique regarding the span of time that is represented.
The different types of data found in the data warehouse have an effect on the data relationships that are found there as well. Figure 5 shows that foreign key relationships are found in the data warehouse.
The foreign key relationship that is depicted in Figure 5 is the same as any other foreign key relationship that might be found in the world of databases with two important exceptions:
- The foreign key relationship is valid for only the unit of time that is represented (either a snapshot or a continuous time span record), and
- The foreign key relationship is valid only for the values of data found in the records and no other values.
These two exceptions make data warehouse foreign key relationships (sometimes called "artifacts of a relationship") quite different from the relationships represented in referential integrity.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access