The part of the debate we focus on here often brings people to the just-in-case decision to make the ODS also time-variant, so that future analytical needs can be met by keeping a full history, in 3NF, in the ODS. This ODS, then, serves two purposes, the second being the very nonoperational purpose of keeping a full history of the organization's data.
Many would call this ODS a data warehouse and would only generate overlapping summarized dimensional models from the base level. This is the general Bill Inmon/Claudia Imhoff approach. On the other hand, the Ralph Kimball approach differs in the sense that the dimensional models are all on the lowest level of available grain, also being the most dimensional at that level, which makes the content of the time-variant ODS and the content of the dimensional model identical - making the ODS redundant from an historical reporting point of view, but perhaps not from an operational integration point of view. Note that an ODS with a pure operational integration role does not need to be time-variant at all.
The rest of this article assumes that, for whatever reason, the decision has been made to have a 3NF time-variant data store to serve the general role of a point of integration as well as to keep a full history of data in the organization, regardless of whether this is also stored downstream in other repositories.
Consider a simplistic entity relationship diagram that depicts the products of a retail organization:
ProdNo -> ProdCat
ProdNo -> ProdOwn
Tx -> ProdNo
This model is in 3NF because the primary key of each of the entities is unique, and none of the attributes are repeating and are, therefore, fully functionally dependent on the primary key and only the primary key.
An example of a denormalized model containing similar content would have looked like this:
Tx -> ProdNo+ProdCat
Or even worse:
Tx+ProdNo+ProdCat
But the content of such a model can easily be corrupted by updating a single row and neglecting to also update all the repeating dependencies in the corresponding rows. This would violate the integrity of the data. Storing data in 3NF is therefore a great way to ensure data integrity because it is impossible to capture multiple descriptions for the same category, as the functional dependency is inherent to the way the data is stored.
The foreign keys explicitly reference the primary keys of the related tables. The foreign key can be null (if the relationship is optional), but if it contains a value, it must correspond to a valid value in the relevant primary key column. The reference has to be on the entire primary key (often only a single column, but many examples exist where a primary key consists of more than one column).
Adding Time to 3NF
Now Father Time enters stage left and causes havoc with Father Codd's beautifully ordered world.
Imagine a product is recategorized. Smarties used to be in the candy category but have now been placed in the chocolate category. Operationally, all that is required is an update. This update then makes the "now" view of the data 100 percent correct. But by implication (and because of the Codd functional dependency), if we look at a total for candy against a transaction for last year, the revenue will have moved to the chocolate category. If these categories have different managers, now reporting their revenue figures would make them all into liars because of the simple update.
This brings the necessity to add time to the model. Note that this entire dilemma occurs because the database is being used for both operational and historical decision-making/MIS reporting. So, from now on, a product will have an active_start_date and an active_end_date. That way we can distinguish between different versions of the product. But now we have a dilemma, in that the natural key of the product file (prodcode) no longer uniquely identifies a product. The natural key together with the active_start_date is unique, but this is not a good candidate for a primary key. Enter surrogate keys, dragged in by the ears from stage left as well. We have to assign a nonintelligent key to product, and we will have to maintain this key as part of the transformation and load process that populates the ODS. This implies that the foreign keys in the tables referencing the product table must now refer to the surrogate key, and no longer to the natural key. The ETL process that populates the ODS has to look up the equivalent surrogate key value and replace the natural key with a corresponding surrogate key. Now the old transactions will have the surrogate key that corresponds to the old version of a product and the new transactions will have the surrogate key that corresponds to the new version of the product. We also have to do the same for product category, because descriptive elements of the product category can also change over time.










Be the first to comment on this post using the section below.