A glossary of technical terms used in these articles can be found on the Web sites of the authors. In addition, a listing of earlier articles and/or columns in this series can also be found on DMReview.com, MindfulData.com and InbaseInc.com.
In the last two columns in the series, we looked back at what the Time and Time Again series covered and stated intentions for future columns. Before resuming our journey using the new roadmap, we should remind ourselves:
- Why this is a road worth travelling,
- why it is important to understand how to create, maintain and make use of what computer scientists call bi-temporal tables and,
- In particular, the kind of bi-temporal tables we call "asserted version" tables.
To understand the special place that asserted versioning has in the scheme of temporal data management methods, we will make use of an extended version of the taxonomy introduced in Part 1 of the series. To emphasize the relevance of our discussions to real-world data management issues, we will consider four hypothetical questions, asked by a client on the telephone with an insurance company's customer service representative (CSR). The context - a telephone conversation - is important. It is in real time, which means the wait time for a response to a database query is very short. It is a critical context, because it is a customer on the line.
The customer asks:
- Was I covered for the cost of a diabetes management class that I took last February? (The CSR must run a query to see if the client's policy, as of last February, covered the class.)
- What do you have as my current mailing address?
- When is my copay going to increase?
- Was Actos not a covered medication on my policy last year? When I spoke to another CSR last March 12th, she told me my policy covered it. (The CSR must now run a query to determine whether or not the company's database, on that March 12th, indicated Actos was covered for that client. Because the database now indicates that it was not covered for that client last year, it was in error if it ever did indicate that.)
Figure 1 is the extended taxonomy. The lower three levels of the hierarchy did not appear in the original diagram. We have added them because over the course of writing these articles, these distinctions have become increasingly important.
The original diagram referred to history, as in database history, reconstructable history, event history, etc., because historical data is the kind of noncurrent temporal data that businesses are most interested in. As our third question indicates, however, information about the future is also important. So because our topic is the temporal data management, and not just the management of historical data, we have replaced history, on the original diagram, with time on this one.
Reconstructable time is a method of retrieving data about the past by restoring a backup copy of that data and, if necessary, applying transactions forward to the point in time you are interested in. Reconstructable time has been around as long as data managed by computers. With periodic backups of files or databases, and a transaction log, it was and is possible to recover the state of files or databases, from any past point in time. There is no historical data recoverable by bi-temporal methods or any other methods that is not also recoverable by means of backup and logfiles.
However, among temporal data management methods, reconstructable time takes the longest and has the highest cost. IT operations must restore the backups if they are not already online. Transaction logs must be accessed. The correct starting point must be found, the transactions affecting the data we are interested in selected and those selected transactions applied to roll forward to the correct ending point. Then, queries or reports must be run to retrieve the desired data. If they are production queries or reports, they point to current files, current databases and current tables. Before they are used to retrieve the historical data, they must be redirected to the historical copy that was just created - a process which usually involves changing database names, tables names and even column names. After it's all over, there is always some amount of clean-up to do.
The cost can easily run into man-weeks of work and take several days to complete. If there is a large amount of data involved, the request is not likely to be made frequently and if the consumers of that data are willing to wait several days for the results, then this method of temporal data management is perfectly acceptable. An annual audit might be an example in which these circumstances apply. But for a customer on the phone, this method of getting to historical data is not good enough for the CSR.
Concerning future data, there is a parallel to reconstructable history. Just as historical reconstruction starts by selecting a backup point in the past and rolling forward by applying transactions, future preconstruction starts with the current state of the database and rolls forward by applying transactions from a queue of transactions that have not yet been applied. Logically, this works. In actuality, there is seldom a nice queue of not yet applied transactions that are complete for the purposes of preconstruction. Nonetheless, it could be done, if enough effort to create complete queues of future transactions was put forth.
Here's an even better idea. Eliminate the need to find a starting point and then a set of transactions to roll forward with. Set up things so that any query against current data can be turned into a query that can retrieve a specified past or future point in time by supplementing the query with the appropriate date or timestamp. This simplification of access was a design objective of the TSQL2 language developed by computer scientists and proposed as a new SQL standard a decade ago by Dr. Richard Snodgrass. This simplification of access to past or future states also motivates our own asserted versioning framework for temporal data management.
So the constraints on our CSR, as she tries to get answers to the four questions, eliminate reconstructable time as an option. She needs queryable time, a method of accessing data in response to an online query without any preparatory work required. As Figure 1 shows, there are two kinds of queryable time.
The method of using event time starts with the initial state of something and keeps track of all the transactions that update it. It is most useful for managing balances - relationships that are described by identifying related parties and supplying various metrics to describe the state of the relationship. For example, a store might begin each year with the on-hand and on-order quantities for each of its products, and from that point forward keep track of orders to and shipments from its suppliers. A query about the on-hand or on-order quantity of any product can be satisfied by taking the starting balances and rolling the transactions forward for that product. Of course, when large volumes of transactions are involved, it is advisable to create various kinds of summary tables in order to reduce the number of transactions that need to be applied to get the desired answer.
Balances are a special kind of relationship, one in which the salient information about the relationship is metric - consisting of quantities, counts and amounts. It should be apparent by now that the standard implementation of event time is the data mart and the fact/dimension star schema structure that it uses.
Except for balances, the best queryable method of managing temporal data is state time. State time is a method of temporal data management that retains all inserts and all deletes in both a before and an after image for every update.
State time applies to those things that have states, meaning those that can change over time. An event, like a withdrawal from a bank account, can't change. But a customer, such as an insurance policy holder, can. Balances can also change over time, but previously stated, it is usually more efficient to keep track of balances by means of periodic snapshots of beginning balances, and an accumulation of all the transactions that alter it.
From a logical point of view, event time and state time are interchangeable. No temporal data is preserved with one method that cannot be preserved with the other. We have these two methods because event time is best suited to keeping data about metric-based relationships, while state time is best suited to tracking changes to objects other than metric balances. But from a practical point of view, none of the four questions in our example can be answered by using event time. We don't build data marts to answer those kinds of questions. What we need is state time.
Excluding the root node, which simply identifies what the taxonomy is about, we have thus far discussed five of the 12 nodes of the taxonomy. We have not encountered a method of managing temporal data that can provide answers to all four of the motivating questions, while meeting the requirement of real-time responsiveness and to providing an answer while the customer is still on the line.
I want to emphasize that while all versioning methods retain both before and after images of the rows they update, this does not mean that these methods will result in an explosive growth in data storage requirements. For one thing, it is transactions that consume most of the storage in almost any business. Also, many of those things persist over time don't change that frequently. Additionally, it is always possible to periodically sweep versioned tables, removing historical data that is unlikely to continue to be needed in real time.
In our next column, we will continue moving through the taxonomy. When its complete, we will have done two things: shown why asserted versioning is the only method of managing temporal data that can answer all four questions and also satisfy the dual constraints of a) real-time responsiveness to queries and b) temporal queries that are simple enough that they can be written by anyone who could write non-temporal queries.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access