Relational databases usually keep data about the current state of things - about how things are now. Updates do occur, of course; but when they do, they overwrite the data already there. In doing so, history is lost. We can't tell how things used to be. To tell how things used to be, as well as how they are now requires that we introduce time into our databases.
This series of articles on a very technical topic is motivated by the fact that increasingly, businesses need real-time access to historical data. No longer can all historical data be pushed off onto archive files or onto less immediately accessible historical databases. The objective of this series of articles is to show how to provide real-time access to history in particular, and to any kind of noncurrent data in general, and to show how to do so using the relational database management systems (DBMSs) and SQL that are available today.
First, a note on terminology. It will often be convenient to talk about history, because that is the kind of noncurrent data that businesses are primarily concerned with. But the real topic is the representation of time in relational databases, whether past, present or future time, whether real-world or database time, whether actual or hypothetical time, whether linear or branching time.
Managing time in today's relational databases is difficult. With more than 50 years of combined experience in data and database management, we have encountered no database model more complex for its size than the model we will propose, no queries fraught with more possibilities for misstatement and misinterpretation than queries against databases that attempt to manage time. But we know of no simpler way to meet business requirements for real-time access to what we will call "versioned history."
However, very little support for time management has yet been provided by DBMS vendors or is specified in the current SQL standards. The complexity of time management cannot be hidden, as it ultimately must be, inside the DBMS itself, with declarative interfaces provided in extensions to the SQL DDL and DML languages.
But businesses need robust and reliable time management in their databases now, not eventually. So we take a do-it-yourself approach. And the first thing we must do ourselves is find a way to encapsulate both time management data structures and the additional complexities in SQL statements that are required for time management data access.
As for updating temporal data structures, all updates should be mediated by application program interfaces (APIs). If an operational data store (ODS) or data warehouse is the database being updated, the API function will be provided by the extract, transform and load (ETL) process, whether hand coded or implemented with a commercial ETL tool.
For retrievals against temporal data structures, all queries should access a set of view tables from which as much of the technical detail of managing time will have been removed as possible. These tables will satisfy the following two requirements:
- When point-in-time data is wanted, these view tables will look, column for column, like tables that are not temporally managed, i.e., like "typical" or "normal" tables in a database.
- When that desired point in time is now, these view tables will guarantee that existing retrievals will remain completely unaffected as time management is added to the tables they access.
We should think of these temporal tables as no more accessible to the end user, DBA or developer than are the dynamic balanced-tree access method structures that provide the file storage subsystem for many commercial DBMSs. The only database users able to directly manipulate these temporal tables should be:
- The developers who translate source system or end-user-supplied update transactions into updates to these temporal tables, and
- The developers who write the CREATE VIEW statements that encapsulate these tables for retrieval transactions.
Figure 1 illustrates the encapsulation necessary to hide the complexities of temporal database structures from both those who update those structures and those who retrieve data from them.
Figure 1: Encapsulating Temporal Tables
When we IT professionals talk about history in databases, we may have several different things in mind. A taxonomy of the types of history that IT data management professionals talk about is presented in Figure 2.
Figure 2: A Taxonomy of History as Managed in RDBMSs
Database history is historical data kept in relational database tables. The two major categories of historical data are reconstructable history and queryable history.
Reconstructable history is data about the past state of something, obtained by restoring a backup file and then applying update transactions captured in a DBMS logfile, from that point in time forward to the point in time of interest. This requires the intervention of IT personnel and is therefore not real time.
Queryable history is data about the past state of something, obtained with a SQL query, without the need for restoring backups and reapplying logged updates. Because the query may be directly executed, this type of history is possibly real time. It is actually real time if its performance meets the parameters which define real time for the class of query it belongs to. Queryable history may be kept as either a record of events or as a record of states.
Event history is data about events that have altered the states of things. This data is captured as transactions in tables that are best queried with links to the leaf-level entries of a set of dimensional hierarchies. Thus, these tables are often queried as the fact tables in star schema data marts, usually by means of OLAP reporting tools.
State history is historical data, captured as copies of the current state of an object, either on a periodic basis or in response to a specific update event. State history may be kept as either snapshots or versions.
Snapshot history is a coordinated set of copies in a relational database. Snapshots are often taken at the entire database level, but sometimes at the level of a semantically related subset of tables. Snapshots occur at regularly scheduled intervals. Consequently, snapshots will miss any updates that are overlaid by later updates to the same data that happen before the next snapshot is taken. Another drawback to snapshots is that they are an inefficient way to record history, because they create copies of rows whether or not those rows have changed.
Version history is individual row-level logical updates, implemented without overlaying data, by "retiring" the current version of the object and replacing it with a new version that contains the updated data. Versions are created on an as-needed basis, i.e., every time an update important enough to be versioned occurs. Consequently, versions will not miss any updates. They are also an efficient way to record history because they create new rows only when a versionable change has occurred.
Thus, for example, suppose that the current version of customer 123's name is Mary Smith. When she changes her name to Mary Jones, the Mary Smith version of that customer loses its status as the current version. The Mary Jones version is inserted and marked as the new current version. Effectivity dates indicate when each version was the truth about Mary, as the business knew it.
Over the past two decades, many suggestions have been made as to how vendors should implement time management in relational databases, and how SQL DML and DDL should be modified to provide declarative access to those new capabilities. The best access to pre-2000 computer science research is written by the principal computer scientist working in the field of temporal databases.1 A well-known commentator on relational databases and a member of the SQL standards committee have also written on this topic.2
But our focus is not on recommendations to vendors or to the SQL standards committees. Our focus is on the here and now. Given today's relational DBMSs, and the SQL common to them, how can we provide the temporal support that businesses increasingly need, and that they need now?
This series of articles will continue in the DM Direct newsletter and will be published every other issue beginning May 4. The discussion will be at the implementation level of detail, although to emphasize its relevance, we will present this very technical material in terms of a real-world business application. In these articles, academic research will be brought to bear on the problem of providing temporal data management in today's relational databases. However, our emphasis is more on real-world implementation than on theory. These articles will address the problems and issues faced by business IT data architects, data modelers, DBAs, those developers who write the ETL or other interface logic to update temporal tables and those developers who write the views which insulate the rest of the IT and business community from the complexities of temporal data.
It's about time.
- R. T. Snodgrass. Developing Time-Oriented Database Applications in SQL. San Francisco: Morgan-Kaufmann, 2000.
- C. J. Date, Hugh Darwen and Nikos Lorentzos. Temporal Data and the Relational Model. San Francisco: Morgan-Kaufmann, 2002.