Welcome back for more bitemporal discussion and reflection. This article is Part 2 of a two-part series on bitemporal data. The first article covered topics such as defining bitemporal data, costs vs benefits of implementing bitemporal, corrections vs business updates, and use cases showing how bitemporal works for inserts, updates, deletes and corrections.

The first article can be found here: http://www.information-management.com/news/big-data-analytics/the-emerging-data-design-bitemporal-data-10028023-1.html or here: https://www.linkedin.com/pulse/emerging-data-design-bitemporal-mike-lapenna.

This article will build upon the first by discussing pre-expiry vs open-ended, slim vs bulky bitemporal, backdated changes, bitemporal foreign keys and a few more interesting use cases. If you are new to bitemporal or want a refresher, it is recommended to read Part 1 first. Let’s start with a brief industry update.

Industry Update

Since part 1 was written, Microsoft has released SQL Server 2016, which supports temporal tables, built-in. It uses the term “System Versioned Temporal Table” and essentially uses a pair of columns VALID_FROM and VALID_TO for validity. It uses a two table solution which is kind of like one Kimball Type 1 table (current data) and one Kimball Type 2 table (historical data).

MSSQL 2016 doesn’t specifically address corrections but it can be overlaid into the solution. With Microsoft joining the club, we now have Oracle, IBM (DB2), Teradata and Microsoft supporting some portion or all of the bitemporal design.

At the same time, there are specialists in the industry who feel the SQL:2011 specification falls short of vision for bitemporal data handing, specifically around querying what you need.

I was recently asked to comment on a one paper (going to the ISO), which hopefully will contribute to making the bitemporal experience easier to adopt and use. Two specific trends will continue to support the bitemporal design:

1) Expanding temporal support by the DBMS vendors, and

2) More critical thinking towards the SQL spec are complementary

If all the majors are supporting built-in temporal tables, why should you read this article on how-to manage bitemporal tables manually? I can provide three reasons:

1) The DMBS implementations are still first generation. In general, they don’t implement bitemporal data 100%. The handling of corrections is lacking. Hence you may need to augment the implementation.

2) You will think about and use the new temporal keywords and commands more carefully. This article describes what the DMBSs are doing (or should be) behind the scenes. DBMS Interpretations of the specification will differ and hence will have their own performance implications.

3) You may be asked to use a DBMS which doesn’t plan to implement the SQL:2011 specification, or you are using an older version of the DBMS which doesn’t support bitemporal. I am sure some of you can relate when I say that some companies are literally a decade behind the current version.

Pre-Expiry vs Open-Ended?

In a data warehouse, you will likely have facts and dimensions. Just so we use the same reference, a dimension might be parties (i.e. companies, individuals and associations) and a fact would be measurements on the parties, (i.e. annual donations in dollars, carbon footprint in tons of CO2, amount of real estate owned in square meters and credit rating).

For dimensions, our VALID_TO is usually open-ended (set to the end of time) because we don’t know when the party will cease to be a party (i.e. a company closes or is acquired, an individual passes away). And it normally is not pre-scheduled.

 

On the other hand, a fact can be open-ended or pre-expired. It depends on the type of measurements calculated. In the example above, annual donations is pre-expired, it is for a one year period. The donation period (VALID_FROM and VALID_TO) can differ among parties but donations usually look at a past period (one might argue you could have a YTD measurement – but this is really a fixed VALID_FROM date to now and will eventually have an end date).

Carbon footprint measurement also would normally have a pre-defined range, such as 2016-01-01 to 2016-06-30, which much like the annual donations. However, real estate owned and credit rating are more likely to open-ended. Credit rating calculations are sometimes scheduled but often not, depending on a party’s size. Real estate owned is most likely open-ended because it often stays constant for some time and the change date is unpredictable. Slim vs Bulky Bitemporal

Slim bitemporal means there is one row for one contiguous period. Bulky means there are several rows for a single period. Both are equivalent in terms of representation and accuracy. Here is an example:

In the Bulky approach, even though there are 3 rows for ID=1, it doesn’t matter which date we choose for the Valid period in a query; it will return the same values for Customer Name and Tier. The AUDIT_FROM=2016-04-01 means all the rows were loaded on April 1, 2016.

Most DBMSs apply some form of compression on their data so the Bulky approach may only result in a marginally larger footprint on disk. So you may ask why I am even describing the Bulky approach. The reason is that I have seen some systems religiously take snapshots on all their tables, regardless of whether the data has changed or not! It is an interesting way of making queries a little easier to form, at the expense of repetitive data.

Backdated Changes

They don’t happen too often but when they do, everyone has an opinion as to how it should be applied. Delete the old record and insert the replacement? In my view, rewriting the past is never a good option. Record the correction in a new shadow table? Reporting with the correct value in place of the incorrect value requires jumping through SQL hoops.

Luckily bitemporal handles backdated changes simply and uniformly. It really just requires two steps: 1) expire the old record (in the past) and 2) insert a replacement record (in the past). Backdated changes are corrections to data recorded in the past. Consider this example:

 

In the first set, we have put Customer XNF in Tier 1 but moved it to Tier 2 for the month of February (for example, because we heard rumors about a financial investigation), but moved them back to Tier 1 after the claims were unsubstantiated.

In the second set, our finance department called to explain that XNF was actually put into Tier 3 due to corporate policy; we had recorded the Tier change incorrectly – Tier 3 is consistent with how we conducted business with XNF during February. And finance would like that reflected in all reports and calculations going forward. This is obviously a fictitious example but every business has a case where at some point they need to correct the past.

The blue shaded cell in the second row represents step 1 (update) and the fourth row represents the insert in step 2. The date 2016-05-06 indicates that this is when we applied the correction to the back dated data. Notice that we have the incorrect and correct data coexisting simultaneously in the same table.

Now we can retrieve either value, depending on the question to be answered:

What was IBM's Tier at the height of the allegations (in February), when we ran the Customer Tier Report at the end of April 2016?

select TIER from

where '2016-04-30' is between AUDIT_FROM_TS and AUDIT_TO_TS

and '2016-02-15' is between VALID_FROM_TS and VALID_TO_TS;

>> returns second row, TIER=2

What was IBM's Tier at the height of the allegations (in February), after the corporate policy was properly applied?

select TIER from

where AUDIT_FROM_TS = '9999-12-31'

and '2016-02-15' is between VALID_FROM_TS and VALID_TO_TS;

>> returns fourth row, TIER=3

Although I used dates in the example, all of them could have been timestamps.

Bitemporal Foreign Keys

In a non bitemporal database, you are certainly familiar with a standard foreign key used in a query’s

WHERE clause such as:

WHERE TableA.ColumnName = TableB.ColumnName

If the design kept history in the same table (it was a Temporal table), you may have an additional

WHERE clause for the business period, such as:

WHERE TableA.ColumnName = TableB.ColumnName

AND TableA.BusinessDate = [BusinessDate] AND TableB.BusinessDate = [BusinessDate]

Or

WHERE TableA.ColumnName = TableB.ColumnName

AND TableA.BusinessDate = TableB.BusinessDate AND TableB.BusinessDate = [BusinessDate]

The way to describe a foreign key in bitemporal is if the entity in both tables is valid at the same single point in time (in this example, [BusinessDate] is the single point in time). Here is how the WHERE clause would look:

WHERE TableA.ColumnName = TableB.ColumnName

AND TableA.ValidFromTS <= [BusinessDate] AND [BusinessDate] < TableA.ValidToTS

AND TableB.ValidFromTS <= [BusinessDate] AND [BusinessDate] < TableB.ValidToTS

The WHERE clause does get lengthier, but the pattern is repeated.

Some of you may wonder about performance. For any analytical DBMS (i.e. OLAP), I have found negligible difference in query performance. If you are using a DBMS designed for multi-use or OLTP, you *may* see some performance variations but if you create indices on these columns, it should perform well.

More Complicated Use Cases

Finally, the interesting stuff. I put this table together to try and explain all the scenarios we might have to deal with, depending on the complexity of anyone’s business.

The first column is the scenario name, the second is the number of new rows we expect when we have that scenario. The third column represents the number of rows that are expired (AUDIT_TO is set to current timestamp). The last column is a visual which represents the active VALID_FROM and VALID_TO range, hypothetically for a single ID/entity, before and after conducting the scenario. The first two rows represent the basic cases while the remaining represent corrections.

 

The Lengthen scenarios end up with a wider Valid range. The reason for the 1+ on Updated rows is that a Lengthen could cause several existing adjacent rows to be expired to make way for the new wider period. The visual shows a simple case where only one valid range exists; in reality you would have a chain of rows with adjacent valid ranges (like Merge Before visual).

The Shorten scenarios are fairly simple – but if you expect to fill in the time gap that was created with another row, you essentially have a split. The split just allows you to break up one large Valid range into several. And finally a Merge simply replaces several rows with one. In terms of new/updated row counts a Merge is the reverse of a Split. And if you look closely, operationally a Merge is like converting Bulky bitemporal to Slim bitemporal. When you split, each new row will be a replica of the original, and then each can be corrected as desired. When you merge, you might have to make some decisions. The business data may differ for each of the original shorter period rows, pre-merge.

Let’s say the three rows had one business attribute with values 17, 18 and 20. Which value should you store? Is it the max, the min, an average, the first, the last, user selected? A merge is a delicate case, and the easiest decision may be to forego the merge and just keep the individual rows, even if it seems like you are unnecessarily using extra storage.

Wrap Up

I hope this article provided a little more insight into the bitemporal approach. By no means did I cover every single complexity that can arise.

For example, there’s additional topics such as “invisible rows” and “initial vs historical data”. For the most part, you have the fundamental understanding you need to tackle issues as they come up. My hope is that the SQL:2011 specification is expanded and improved in the next couple of years so that the bitemporal approach becomes more entrenched, more accessible, and more functional. I invite your experiences and comments on this article.

(About the author: Mike Lapenna is a data architect and modeller based in Toronto, Canada. Mike has over 30 years of technical experience and specializes in master data management and the bitemporal approach. Financial services is generally Mike’s focus but the bitemporal approach is equally applicable in all sectors. Contact Mike at 7mlapenna@gmail.com).

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access