Assume for this discussion that we have (after long deliberation) worked out that we need a third normal form (3NF) operational data store (ODS) - with a real operational need for cross-source integration and, perhaps, some operational reporting requirements to fulfill. Most people, though probably not all, would agree that 3NF is not the most optimal storage model for large-volume analytical reporting. However, most people would agree that 3NF is a great format as a source for some downstream analytical data store, where the data is usually stored in a more dimensional (implied: denormalized) and query-friendly format. There are very lively and interesting ongoing debates about whether such dimensional models should be on transaction-level grain or only summarized.
The part of the debate we focus on here often brings people to the just-in-case decision to make the ODS also time-variant, so that future analytical needs can be met by keeping a full history, in 3NF, in the ODS. This ODS, then, serves two purposes, the second being the very nonoperational purpose of keeping a full history of the organization's data.
Many would call this ODS a data warehouse and would only generate overlapping summarized dimensional models from the base level. This is the general Bill Inmon/Claudia Imhoff approach. On the other hand, the Ralph Kimball approach differs in the sense that the dimensional models are all on the lowest level of available grain, also being the most dimensional at that level, which makes the content of the time-variant ODS and the content of the dimensional model identical - making the ODS redundant from an historical reporting point of view, but perhaps not from an operational integration point of view. Note that an ODS with a pure operational integration role does not need to be time-variant at all.
The rest of this article assumes that, for whatever reason, the decision has been made to have a 3NF time-variant data store to serve the general role of a point of integration as well as to keep a full history of data in the organization, regardless of whether this is also stored downstream in other repositories.
Consider a simplistic entity relationship diagram that depicts the products of a retail organization:
ProdNo -> ProdCat
ProdNo -> ProdOwn
Tx -> ProdNo
This model is in 3NF because the primary key of each of the entities is unique, and none of the attributes are repeating and are, therefore, fully functionally dependent on the primary key and only the primary key.
An example of a denormalized model containing similar content would have looked like this:
Tx -> ProdNo+ProdCat
Or even worse:
But the content of such a model can easily be corrupted by updating a single row and neglecting to also update all the repeating dependencies in the corresponding rows. This would violate the integrity of the data. Storing data in 3NF is therefore a great way to ensure data integrity because it is impossible to capture multiple descriptions for the same category, as the functional dependency is inherent to the way the data is stored.
The foreign keys explicitly reference the primary keys of the related tables. The foreign key can be null (if the relationship is optional), but if it contains a value, it must correspond to a valid value in the relevant primary key column. The reference has to be on the entire primary key (often only a single column, but many examples exist where a primary key consists of more than one column).
Adding Time to 3NF
Now Father Time enters stage left and causes havoc with Father Codd's beautifully ordered world.
Imagine a product is recategorized. Smarties used to be in the candy category but have now been placed in the chocolate category. Operationally, all that is required is an update. This update then makes the "now" view of the data 100 percent correct. But by implication (and because of the Codd functional dependency), if we look at a total for candy against a transaction for last year, the revenue will have moved to the chocolate category. If these categories have different managers, now reporting their revenue figures would make them all into liars because of the simple update.
This brings the necessity to add time to the model. Note that this entire dilemma occurs because the database is being used for both operational and historical decision-making/MIS reporting. So, from now on, a product will have an active_start_date and an active_end_date. That way we can distinguish between different versions of the product. But now we have a dilemma, in that the natural key of the product file (prodcode) no longer uniquely identifies a product. The natural key together with the active_start_date is unique, but this is not a good candidate for a primary key. Enter surrogate keys, dragged in by the ears from stage left as well. We have to assign a nonintelligent key to product, and we will have to maintain this key as part of the transformation and load process that populates the ODS. This implies that the foreign keys in the tables referencing the product table must now refer to the surrogate key, and no longer to the natural key. The ETL process that populates the ODS has to look up the equivalent surrogate key value and replace the natural key with a corresponding surrogate key. Now the old transactions will have the surrogate key that corresponds to the old version of a product and the new transactions will have the surrogate key that corresponds to the new version of the product. We also have to do the same for product category, because descriptive elements of the product category can also change over time.
This way our model stays pure 3NF in structure. The referential integrity constraints can still be enforced using the declared referential integrity constraints implemented inside the database management system (DBMS).
The Data Explosion
Alas, we may think all is well on stage, but consider, for example, the description of a product category has changed. This means that at a given point in time a new version of the product category is created. In our (very simplistic) example, we now have an orphan product category (as it doesn't contain any products). In reality, we now have to create new versions of the products from this point in time onward, and new versions of everything else that directly or indirectly references the product categories. Say 100 products fall into the chocolate category; all 100 product entries have to be duplicated and given a new surrogate key.
This sounds trivial, but in a low-cardinality versus high-cardinality scenario, this can be devastating. Imagine a client segment that changes description, affecting 10 million client records. Suddenly the pure 3NF model does not seem so practical any more. Specifically note that one of the main reasons why we want to store data in 3NF, namely to save on space, disappears off the stage in a flash. Also note that the underlying rows in the table are no longer normalized because the functional dependency still exists between, say, product code and product name, and now repeating groups are starting to appear. So even though we can still declare the referential integrity constraints on the surrogate keys, we are in fact not enforcing the integrity of the crucial underlying data anymore.
For the time being, let's remove the surrogate keys from the model, and leave the model as having a natural key and an active_start_date and active_end_date. In this scenario, we cannot declare the referential integrity on a database definition level, but we can still enforce it through the ETL process. In doing so, we can reduce the number of rows in the database significantly. However, this puts an enormous responsibility on the people querying the database, as they have to restrict each and every query on the table by the appropriate dates. The row is no longer unique just on natural key, but on the natural key and active_start_date combination. The active_end_date of each row also has to be updated appropriately to ensure that there are no gaps for a given natural key. There are also not supposed to be any overlaps in the active_start_date, active_end_date for any given natural key. So, to query any given table, a point-in-time has to be chosen, and the rows have to be filtered out with a "between" statement. To get the "now" view, the where-clause has to contain "sysdate() between active_start_date and active_end_date" for each table in the FROM clause. A set of views can easily be created to reflect the "now" view of the database. This set of views will be in pure 3NF, both in structure and in content.
Note that if a query does not include a point-in-time filter, duplicate natural keys will be returned (one for each version of a given natural key). If this result is joined to another table, a Cartesian product will result. If many tables are involved in such a query, the result might be orders of magnitude larger than expected because each natural key's duplicates will be combined with all the duplicate natural keys from all the referencing or referenced tables, exponentially growing the number of rows in the result.
Additional benefits include the ability to query two tables as at two different points in time. For example, it creates the ability to restrict the Branch table to how it looked 12 months ago, but join it to the Balances table as it looked three months ago. It's surprising how often this kind of query is required for what-if type analyzes. Note that the structure of the database is still 3NF, and that the content, when filtered on a given point in time (as described above), is also still in 3NF, but across time, the content is no longer in 3NF. The onus now rests on the processes that populate the ODS to ensure that the referential integrity of the data is intact. It would also be possible to code triggers in the database that are time-variant-aware to reject any invalid data, but standard declarative referential integrity constraints cannot be enabled. In this way the data can still be manipulated and joined using the natural keys. Note, however, that this approach violates one of the key premises of an ODS, namely, to make it easy and intuitive to query.
I first heard this term used by Ralph Kimball to refer to a system-assigned natural key that is guaranteed to remain the same for the entire lifetime of an entity. When a very wide natural key is used in a source system, an integer surrogate key may be introduced into the ODS, but this surrogate key is not a traditional dimension-like surrogate key, which increments each time a change occurs to a natural key. It is, rather, a super-natural key, which has a one-to-one relationship with the natural key. At this time, in order to save space, and in keeping with the time-based 3NF, it may be permissible to replace the natural foreign key in any referencing tables with the equivalent super-natural key as a foreign key. The ETL process then has to do a traditional key lookup to replace the natural foreign key with its corresponding super-natural key value.
Note that there is no reason why the original natural keys from the source system cannot be used. Because there is no logical difference between the natural key and the supernatural key, the model is structurally identical but may use up less space as a supernatural key can be implemented as a four-byte integer.
Through a progression of thought-processes, this article has illustrated how one can get to an ODS model that is in a time-variant 3NF. The benefit of a time-variant 3NF structure is that it can contain a full history of data. This can be achieved utilizing either natural keys or super-natural surrogate keys that have a one-to-one value relationship with the natural key.
The overhead involved is storage of the additional versions of each entity as it changes over time. In addition, querying the database is a lot more intricate because each and every table involved in a query has to be restricted to its version that was applicable at a chosen point in time. The danger of this structure is that queries that do not apply the correct filters will result in Cartesian product results (which are both incorrect and processing resource intensive). It also violates the premise of making the ODS easy to use and efficient to query.
An ODS with a pure operational integration role does not need to be time-variant at all, but if, for whatever reason, the requirement exists to have an ODS that is both operational and also must contain a full history of data, this time-variant 3NF method is a very practical implementation alternative.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access