Much discussion has been made of slowly changing dimensions in the data warehouse environment. In truth, the discussion has been generated by the advocates of multidimensional technology as they address the issues of star schemas, fact tables and multidimensional tables. However, the notion that data can change over time is valid one, and it is as applicable to the corporate information factory as it is to other data warehouse environments.
How then does the corporate information factory handle slowly changing dimensions or the more general case of changing data over time?
There are several ways in which data that changes over time is handled in the corporate information factory. The first and most profound way is through the gathering of historical data at a granular level of detail. The very essence of the data warehouse is the collection of granular data over time. This often means collecting transaction data, detailed customer data or detailed production data over time. The very details are kept and faithfully transcribed as they occur. Not only is the data about the transaction placed in the data warehouse, but the time the transaction occurred is also stored (e.g., the sale of a part AX-128U made on April 13 to customer GM for $200 with shipment of the order on April 16).
The details of the transaction and its date are stored in the data warehouse. As time passes, the data if it has been recorded correctly does not change. Other incidental aspects of the order may change over time, such as the fact that GM referred to General Motors on April 13 and now refers to General Mills. Or, the corporation may have added a new identification for orders of more than $100 on May 14. However, the basic facts of the transaction do not change once recorded properly.
What does change over time is meta data and reference data.
Reference data takes many forms. In some cases, reference data is encoded data (e.g., GM equals General Mills, FMC equals Ford Motor Company). In other cases, reference data is ZIP code data 80104 is for Castle Rock, Colorado. In yet other cases, "desc" is a shortening of description and "ptno" is a shortening of part number. There are many forms of reference tables.
Perhaps the simplest way to handle changing reference data over time is to periodically (e.g., every six months) take a snapshot of a reference table and store the reference table in an archived manner. Then, if there is a need to go back in time and find out what a particular value was at any moment in time, the appropriate archived value can be retrieved. However, the complication with this approach is that capturing the value of a reference table every six months may not capture the actual values that have passed through the reference table. For example, suppose that the reference table is archived every January 1 and every July 1. What happens if a change is made to the table on January 15 then reversed on May 15? The capturing of the contents of the reference table will not be accurate in this case.
In some cases, it will not matter that the reference data is not perfect over time. However, if it matters that all changes be captured, then a more complex approach must be taken. A snapshot of the reference table can be taken once a year. Then, the updates that occur throughout the year are saved along with the snapshot. Then if it is desired to go back in time and recreate the reference table exactly as it was, the actual updates can be applied after the fact and a perfect replication of reference data can be achieved. In doing so, there is no loss of accuracy of reference data. It is noteworthy that this approach is complex.
It is assumed that reference data is not as voluminous as other forms of data. Storing entire tables of reference data every six months or so is not feasible if reference data consumes much space.
By periodically taking a snapshot of reference data and by tracking the changes that are made to it, reference data can be changed over time and those changes can be reflected as well.
The third form of data that changes over time is meta data. Meta data is much more problematic when it comes to managing changes over time. The first problem with meta data is that current-valued meta data is normally not stored in an easy-to-access manner, much less is it stored over time. About the best that most shops do when it comes to meta data is to store current meta data centrally in a repository. Unfortunately, meta data is distributed and is managed and owned in different places throughout the environment. Therefore, a centralized management of meta data defeats the purpose of meta data. Asking that changes of meta data be managed over time is asking a lot given the crude state of affairs of meta data management.
Nevertheless, meta data (such as database structures, database relationships, data definitions, source data, target data, transformations, calculations, conversions, data models, etc.) does change over time.
Each of the forms of meta data needs to be tracked over time. The normal way that they are tracked is to create a series of continuous definitions. The definitions might look like the following for customer:
Jan 1 to Feb 15 someone who buys a product
Feb 16 to Mar 21 someone who buys a product or who might buy a product
Mar 22 to Apr 6 someone who buys a product, might buy a product or who has bought a product
Apr 7 to May 8 a person or an organization who buys a product, might buy a product or who has bought a product
May 9 to Jun 19 a person or an organization who has bought a product or a service, who buys a product or a service or who might buy a product or a service
By tracking the meta data over time, the changes are kept so that at any point in time, it is possible to see what the current definition of customer is (or was). There is no overlap, and there are no points of discontinuity. By keeping track of meta data continuously, the history of meta data can be stored.
It is important to keep meta data because it provides context. As a simple example of the importance of meta data and the context it provides, consider its use over time. Suppose a manager asks for a new report for U.S. sales. The report is generated. Now the manager asks for a report for the past five years. The manager looks at the reports and sees that revenues dropped precipitously from 1999 to 2000. The manager simply does not believe the numbers. However, the analyst looks back to the meta data that describes the data on which the reports are based and finds that in 1999, sales figures included Canada and Mexico; but in 2000, Mexico and Canada were excluded from sales figures. In fact, sales picked up nicely from 1999 to 2000, but the raw data did not show that to be the case. Only when meta data from those years was accessed was the context of the numbers obvious.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access