Free Site RegistrationFree Site Registration

Sign up today and access Information Management on the web!
Your FREE registration entitles you to:

FREE email newsletters

FREE access to all Information Management content

FREE access to web seminars, resource portals, our white paper library and more!

Time and Time Again: Managing Time in Relational Databases, Part 4 - Preliminaries to Version Pattern 3

InfoManagement Direct, June 2007

Tom Johnston, Randall Weis

The first two version patterns, described in Part 3, support a minimal form of queryable history in business databases. But sometimes that isn't enough. For even if we use Version Pattern 2 - updating in place with logical deletion - there are still two important ways in which history can be lost.

First, as soon as the first update takes place, the row, as originally entered, is lost because the update overwrites the row. So we lose the information that was true of the policy on the date it became effective. Next, considering the entire row as a single unit of data, all updates but the most recent one are also lost, including even the knowledge of whether or not there were any previous updates. Sometimes this is acceptable; but sometimes it is not.

Advertisement

Note: this may suggest that we should manage versions at the level of individual row/columns, not at the level of entire rows. Although that would certainly be possible, we believe that the same semantics can be supported by versioning entire rows, and at a lower overall system cost. We will not consider row/column level versioning in the remainder of these articles.

Now we will consider a more stringent business requirement for history.

{BR-3}. Support {BR-2}, and also keep a complete queryable history on policies, i.e., on all inserts, updates and deletes to all policies.

As our history taxonomy indicates, our first choice is to keep this history as a chronological record of either events or states. We could keep an event-based history of changes to policies by adding a row to the Policy table each time a new policy was created, and after that keeping a transaction table in which each transaction was an update or delete to the policy.

Event-based history is most appropriately used to manage changes to metric values of relationships among persistent objects, values such as counts, quantities and amounts. These relationships that contain metrics are often called balances. Each change to a relationship metric is recorded as a transaction, and in the traditional star schema design, each transaction contains a date, the change to each metric, and a foreign key to each object that is affected by the transaction, objects such as customers, products, suppliers and so on.

However, event-based history is not appropriate for managing changes to non-metric values of persistent objects. For them, state-based history is the preferred option. Because we have chosen to keep a state-based history of policies, we must now decide whether to do so using snapshots or versions. Snapshots are the preferred method for recording the coordinated history of many rows, usually across many tables, as of a single point in time. However, if the same column in the same row is updated more than once between snapshots, only the last update will be captured. Therefore, since the business requirement is to track all changes to policies, we must do so using versions.

What Version Tables Version

Before we dive into the details of managing versioned history, we must first understand what is being logically inserted, updated and deleted in versioned tables. It is not versions. It is the objects they are versions of. The physical insertion of a version logically inserts, updates or deletes an object, and does so in the following way.

  • The physical insert of the first version for a policy is a logical insert of that object, i.e., that policy.
  • The physical insert of subsequent versions for that policy, which are not designated as a logical delete of the policy, are logical updates of the policy.
  • The physical insert of a version marked as a logical delete logically deletes the policy.

Note that "object" is being used here in the sense of "anything that persists over time," not in its narrower object-oriented sense. Customers, products, suppliers, contracts, employees, facilities, inventories, packages, invoices, purchase orders - all these are objects. For the most part, we are using insurance policies as illustrative objects in these articles.

A Policy Version Table

Each row in the original Policy table represents a policy (of course). But no row in the Policy Version table represents a policy! Instead, each row represents what we know about what a policy looked like during a designated period of time.

Multiple rows on the Policy Version table may have the same policy-nbr, of course. If this were a Policy table, that would be an error. But on a Policy Version table, it is not an error. On a Policy Version table, those rows, and only those rows, with a given policy number are versions of that policy. Those rows, and only those rows, contain information about that policy during a period of time.

So a version is a timeslice of an object. It is not the object itself. A complete sequenced set of versions is a full timeline for an object.

Since none of these rows in the Policy Version table represent policies, where are the rows that do? If this table is not a table of policies, where is that table?

The answer is that, on our implementation of versioned history, there is no Policy table. The reason is that we don't need one. For any point in time during which a policy was a policy, we can retrieve a row from the Policy Version table that represents the policy as of that point in time. The first row in the Policy Version table for a policy shows what the policy was like when it became effective. Subsequent rows show what the policy was like after each of the changes that happened to it and until the next one happened. If the Policy terminated, there is a version for that event also.

A Simplifying Assumption: One and Only One Episode per Object

Before proceeding, we will introduce two simplifying assumptions into our initial discussion of versions, assumptions which we will drop later on. The first one is this: objects can't recur.

For example, if Mary Jaspers' policy P138 terminated on a given date, but the next day Mary called and said she wanted to continue her policy, then given this assumption, AHIC would have to issue her a new policy. If Mary wanted no changes to her policy, the new policy would be identical to the old one except for its policy ID number, but it would in fact be a new, distinct policy.

If an object could recur, what would that look like, as far as data is concerned? In the case of Mary's policy, it would look like a logical delete of policy P138 followed, after one or more clock ticks, by another version for P138, which is not a logical delete. Normally, persistent objects re-appear after some period of time, i.e., after some possibly large number of clock ticks. So there can be gaps between episodes of the same object, although if there is only one clock tick between a logical delete and the next version of an object, those episodes would not have a gap between them.

This is illustrated in Figure 1.

Figure 1: Episodes of an Object

Since that next version is an initial version of the policy after a prior logical delete of that policy, it constitutes a reappearance of the policy. Let us call the first version for P138 (the Jan 1, 2005 version) the original initial version, and each other version which immediately follows a logical delete of P138 (the Jul 15, 2005 version) a successor initial version. The count of initial and successor versions for an object is a count of what we will call episodes of that object. Gaps may or may not exist between successive episodes. In this case, the gap extends from one clock tick after the first episode is terminated, to the clock tick on which the second episode begins.

If the number of logical deletes of an object is the same as the number of episodes, then, for as long as that is true, the object does not currently exist on our database. That is because every episode, including the most recent one, has been terminated by a logical delete.

The only other case that can arise is that the number of episodes is one greater than the number of logical deletes. That is because, in a series of versions for the same object, what splits them into episodes are the logical delete versions. In this case, the object does currently exist on our database because there is no logical delete version for the most recent (or only) episode.

So our simplifying assumption can also be expressed like this: (for now), each persistent object has one and only one episode. That episode, of course, can consist of any number of versions.

This single-episode constraint means that AHIC (our fictional insurance company) does not permit policies to recur. So if a policy lapsed for even a single day, and the policy holder wanted her policy reinstated immediately, she would be assigned a new policy. The initial version of that new policy would become effective one clock tick after the deletion of her previous policy. This is not a realistic way of doing business, of course, and later on we will consider scenarios in which a lapsed policy may be reinstated.

Another Simplifying Assumption: No Gaps Within Episodes

Our second assumption is that there can be no gaps, within a given episode of an object, between successive versions. One version must follow right after the other. That is, all noninitial versions of an episode of an object must begin on the next tick of the clock after the version they supercede ends. So in Figure 1, each arrow which is labeled "version {n} ends; version {n+1} starts" actually points to two adjacent clock ticks.

A corollary is that each nonterminal version of an episode of a persistent object must end on the tick of the clock immediately before the tick on which the version they precede begins. (See Part 2 for a discussion of clock ticks.)

Let us call a gap within an episode an intra-episodic gap, using the term "episode" as indicated above, to refer to the time span for an object that reaches from the begin date of its initial version to the end date of the first logical delete reached by following that object's versions in chronological sequence. And let us call the first assumption, that objects have only a single episode, the requirement that there be no inter-episodic gaps.

Note: Bear in mind that while we reference an event's commencement and conclusion as begin date and end date, these temporal markers would frequently be a more granular date/timestamp. As always, the key concept here is a "tick of the clock," at whatever level of granularity is appropriate for your specific application.

Semantic Constraints

Our two simplifying assumptions come to this: until further notice, these articles are discussing single-episode objects whose versions have no gaps. The semantic constraints imposed by these assumptions are as follows:

  • {SC 3-1}. If a version for an object is followed by another version for that same object, the end date of the former version must be one tick of the clock prior to the begin date for the latter version.
  • {SC 3-2}. If a version for an object is not followed by another version for that same object, and is not itself a logical delete version, the end date will be assumed to be unspecified. (This is not always the case, of course; and in later patterns, we will drop this simplifying assumption.)
  • {SC 3-3}. If a version for an object is a delete version, the object will be considered deleted as of that delete date.
  • {SC 3-4}. It is semantically illegitimate for any version of an object to follow (to be chronologically later than, based on its version date) a logical delete version of that object.

There are several issues raised by these semantic constraints, all of which we will discuss later on. They include:

  • The semantics of "unspecified" end dates. For some versions which do have an end date, that end date may be unknown at the time the version is entered. For others, which we can think of as "until further notice" versions, there is no end date, known or not, at the time they are entered. In the literature, an unspecified end date is sometimes called "forever." But forever implies that we know the version will never terminate, and there aren't many objects which have versions like that. And one may hope that if an end date for a version should be known, it is known at the time the version is entered. If we can assume that, it leaves the "until further notice" kind of version as the only kind of version with an unknown end date. But its semantics is not the same as the semantics of forever. This issue will be discussed in later installments in this series.
  • The existence of versions with a known end date. We must be able to manage versions of objects with a known end date, and also versions of objects whose end date is not known. As we will see later, this can prove quite challenging.
  • The fact, as already noted, that businesses often need to recognize the re-appearance of the same object after a gap in time. This is the need to permit multiple episodes of an object on the database. It is a common business requirement; customers come and go, products are offered for sale, later retired, and later still reintroduced, etc.
  • The fact that businesses often need to keep two or more versions with overlapping or identical effectivity periods.

This last point is especially important. In normal cases, versions cannot temporally overlap one another. If they did, there would be a period of time during which both were effective. But if they were both effective, how would we know which one was the truth? And what would the other one represent, if not the truth? These are topics for later articles in this series.

Corrections, Gaps and Overlaps

To avoid intra-episodic gaps, we must write code to insure that every new noninitial version begins exactly one tick of the clock after the version it is about to supercede. However, while necessary, this is not sufficient to guarantee that there will be no intra-episodic gaps. A second way in which such a gap can be introduced is with a correction.

Consider a non-initial version entered with an effective begin date that is earlier than it should be. In that case, we need to move its begin date forward. But doing so would introduce a gap between it and its precedessor version. By the same token, moving an effective end date back would introduce a gap between it and its successor version. To prevent such corrections from violating {SC 3-1}, we must always correct two versions when an effective begin date on a non-initial version is moved forward or when an effective end date on a non-terminal version is moved backward. One is the version itself that is being corrected. The other is the adjacent version, which must now be corrected to "fill in the gap" created by the first correction.

We should not think of these two physical updates as two transactions. There is one transaction; it is the movement of the transition point in time between two adjacent versions. This single semantic update requires two physical rows to be updated.

Semantic constraint {SC 3-1} also requires that updating an effectivity date range must not create an overlap. One caveat, of course, is that a correction which does not alter a version's effectivity date range must have exactly the same date range as the version it corrects.

Version Patterns 1 and 2 were both "update in place" patterns. In this installment, we have laid the groundwork for discussions of patterns which do not update in place, i.e., which do not overlay data already on the database. First, we introduced two simplifying assumptions, which we will eventually drop as we discuss increasingly complex patterns. Next, we introduced some concepts which will prove useful in those later discussions. The concepts were:

  • Timeslice
  • Timeline
  • Original initial version
  • Successor initial version
  • Episode
  • Gaps
    • Intra-episodic gaps
    • Inter-episodic gaps

    In Part 5, we will begin our discussion of Version Pattern 3.

Tom Johnston is an independent consultant specializing in enterprise data architecture, and in relational, object-oriented and data warehouse modeling in various industries, including telecommunications, health care, banking, retailing and transportation. He can be reached at tjohnston@acm.org, and his Web site is www.MindfulData.org.

Randall Weis, founder and CEO of InBase, Inc, has more than 24 years of experience in IT and IT management, specializing in enterprise data architecture. Weis' technical expertise is in sophisticated, multitiered systems. He has designed logical and physical data models and implemented several high profile, very large database (VLDB) systems in the financial and insurance industries. These systems have had very stringent performance and real-time history requirements. His software development company, InBase, Inc., has developed software and Web sites used by some of the nations largest companies. Weis has been a presenter at various user groups, including Guide, Share, Midwest Database Users Group and Camp IT Expo. His technique for modeling history, retro activity and future dating has been reviewed and approved for the physical implementation of IBM's Insurance Application Architecture (IAA). He may be reached via email at randyw@inbaseinc.com.

For more information on related topics, visit the following channels:

Advertisement

Advertisement