If you are a data architect or a business analyst, chances are you’ve used some type of bitemporal data design during your career. Many of us have been exposed to aspects of the bitemporal design by using time series data, temporal data, or historical data. You can find examples on the internet and in newspapers every day (i.e. ycharts.com, census.gov). To me, it is too good to not get an in-depth evaluation, hence I am writing this article. I will share the concept of bitemporal data, and examine patterns for successful implementation.
What is Bitemporal Data?
Simply defined, bitemporal data means storing current and historical data, corrected and adjusted data, all together in the same place. Bitemporal means you are using two time dimensions simultaneously – one to represent business versions and one for corrections. For example, let’s say you have a database table of customers; in a bitemporal world, you would store changes (versions) of the customer’s data, over time, as well as any corrections, as new rows in the same table.
Customer data changes include attributes like the customer’s name, address or buying preferences. Corrections (some people like to call it adjustments) represent restatements of data that people or systems make to record the right value. Human typing errors or software errors create data that may get corrected. I am going to come back to this topic later since it is very important to be able to distinguish when and how a data change occurred.
As I explain how the bitemporal design works, you will begin to see the benefits it provides to the end user – the analyst, the business user. There is a trade-off no doubt; it will take more effort to store the data in bitemporal format, for the benefit and power of the data consumers. Once it is stored correctly, retrieving (querying) exactly the data you want is straight forward. Implementing bitemporal design is a classic case of spending proportionately more time on inserting data so that comprehensive and performant querying will be possible.
Our New Regulatory Environment
It is no surprise that government regulations, and more importantly their ability to enforce them, have been swelling (think NSA) in recent years. Regulators have their work cut out for them and they expect companies to help them by digging into their data to provide evidence of events. It can be to prove you didn’t do something you shouldn’t have, as well as you did you did something you were supposed to do.
In financial services, banks have been struggling since the 2008 financial crisis to improve their business processes and data completeness and quality, to meet new regulations. In many cases, they are reducing redundancy and consolidating – that means unifying many data store designs and styles together. The bitemporal approach has been resilient in dealing with mainstream data issues as well as edge cases.
Is the Cost Worth the Benefit?
From a budgeting perspective, the cost to implement bitemporal design is certainly higher than implementing a simpler, less comprehensive design. However, if it were possible to look into the future and quantify all the time business users will waste trying desperately to gather the data they need to answer the questions they and regulators have, there would be no contest. I am convinced bitemporal would be cheaper by a long shot. It’s a classic fixed cost now versus ongoing operational cost comparison. Unfortunately, IT management changes so quickly that few want to make the investment, and not see the rewards. If your horizon is more than a couple of years, read on.
For the technical readers, you’ve probably been wondering when I’m actually going to examine what bitemporal entails –now’s the time. The bitemporal approach requires four timestamp columns on each table: VALID_FROM, VALID_TO, AUDIT_FROM and AUDIT_TO. The Valid pair represent the period that row is valid from a business perspective. In a straight temporal design (or Kimball’s Type 2), these columns may be called EFFECTIVE_DATE and EXPIRY_DATE. [In general, the bitemporal design uses timestamps in order to support intraday changes if applicable]. For example, a customer’s annual sales may be $1M from Jan 1 to Mar 31, then $1.2M from April 1 to June 30; these would be stored as two rows with differing VALID_FROM and VALID_TO ranges.
The Audit pair represent when we believed the row was correct. The “current” row is denoted by AUDIT_TO_TS equaling the end of time (usually 9999-12-31). When a correction is needed, we set the AUDIT_TO_TS to now (or whenever we learned it was incorrect), and add a new row with the AUDIT_FROM_TS set to now and AUDIT_TO_TS equaling the end of time (usually 9999-12-31).
Corrections/Adjustments vs Business Updates
Here is where we get into subtleties. I have seen at least five different ways that applications use to store their corrections (or adjustments). Before we get into this discussion, let me define corrections. (a correction is synonymous with adjustment). It is essentially the changing of data from a previous value (which is considered to be incorrect) to another value, which is considered to be correct. There can be one or more reasons for corrections but that doesn’t affect how it is stored.
A correction differs from a business update in that a business update is a change in data resulting from normal course business events, while a correction is a restatement of incorrect, previously entered/loaded data. For example, suppose you needed to store country codes and you chose to use UA for United States, and then a manager checked it and advised you it should be US, the revised country code is classified as a correction. However, if the US government applied to ISO and requested its country code be changed to UX and the request was granted, the change would be classified as a business update.
Here is a summary of the basic CRUD operations and how they are represented bitemporally in this fictitious table which contains country codes.
Use Case 1
A new row was added to our Country table for the United States. The Country ID is a surrogate identifier and can be ignored for this discussion. The US came into existence as a country on July 4, 1776 and that is what we assign to VALID_FROM_TS (Valid From Timestamp). Our system just launched on December 1, 2015 and we loaded the row into our data warehouse at 10:00 on the first day, hence our assignment to AUDIT_FROM_TS. The US is a country (considered correct) until further notice so we assign End of Time (EOT) in the VALID_TO_TS.
Use Case 2
Let’s say the US decided to cease becoming a country at the end of 2015 and split apart into individual states (much like what USSR did in 1980). To delete the US as a country, we first set AUDIT_TO_TS for the current row – meaning this is our correct information up to this point. Then we insert a new row which represents US’s last day of being a country – set VALID_TO_TS to end of 2015. The AUDIT_FROM_TS on the new row represents when we loaded this into the database. Note that we had advanced warning and hence we had the correct information about a pending future event.
Use Case 3
In this case we are dealing with a correction. Let’s say our data steward accidentally clicked Save before fully spelling United States. When the correction is made, we expire out the incorrect row (like a delete) and then insert a new row with the correct data. In this case, the correction was made the next morning. Note that the correction has no bearing on when US started or ceasing to be a country; hence no changes occur to VALID_FROM_TS and VALID_TO_TS.
Use Case 4
In this contrived case let’s say the US officially changed its name from United States to United America on January 1, 2016. This is a business update – all previous information is still relevant and correct. To perform a bitemporal update, first we perform a delete (Use Case 2) and then an insert (Use Case 1). You’ll end up with 2 new rows – 1 representing the previous version up to one point in time and 1 representing the new version from that point in time onwards.
The timeline below shows what is happening in the two time dimensions for each Use Case. The blue bars represent business data (VALID_FROM_TS and VALID_TO_TS) and the green bars represent “as-of” data (AUDIT_FROM_TS and AUDIT_TO_TS).
To query your country data, adjust your WHERE clause in SQL or BI tool:
How do you find the latest version of an entity? Add VALID_TO=’9999-12-31’ and AUDIT_TO=’9999-12-31’
How do you find the version for a specific date? Add between VALID_FROM and VALID_TO.
How do you find the data, as it looked in the past? Add between AUDIT_FROM and AUDIT_TO
So Who Is Using Bitemporal?
At some point, you’ll get the question – who uses bitemporal? Several big names in the Technology world either support or use the bitemporal approach: Oracle (12c), IBM (DB2), Teradata, and SAP Finance and Accounting ERP. The bitemporal approach is actually part of the SQL 2011 spec (the best free summary to read is: en.wikipedia.org/wiki/SQL:2011). If you are wondering why it isn’t more widely known is because it’s a pretty big change to existing applications. It will take some time for companies to bring the bitemporal approach in – as with many new technologies, the first implementation will be challenging.
We’ve Been Doing This for Years
Some of you are reading and thinking, “We’ve already been doing this for years; we just didn’t call it bitemporal”. Agreed. As I acknowledged in the introduction, many of us have been using some aspects of bitemporal. Creation Date, Last Updated Date, Is Current Flag, Load Time (or variations of these) are column names I have seen, which represent the aspects to which I am referring. The bitemporal approach is not revolutionary – it simply puts it all together.
Having one table holding current and historical, correct and previous adjusted data means less joins in your queries. Queries against traditional DBMSs often slow down with many joins. Granted, this is less of a problem with today’s analytic platforms like Teradata IDW/Aster, Oracle Exadata and IBM PureData. But, as data volumes have grown, the little things matter again – like changing the length of a column from 12 to 10 characters can save you hundreds of gigabytes of storage.
Bitemporal vs Snapshots
Many systems I have encountered take a snapshot at a point in time, usually end of the calendar or fiscal month, and record the measurements. Once a month was fine a couple of decades ago but the speed of business is accelerating. Waiting once a month to get the data you need to make business decisions is now becoming a competitive disadvantage. And not surprisingly, regulators are asking for information more frequently. Is it possible to convert a monthly system to weekly? Yes of course, with a significant amount of conversion work.
The bitemporal approach handles yearly, quarterly, monthly, weekly, daily or intraday all in the same design. Snapshots are still valuable because they simplify queries for report writers and analysts to discrete points. But snapshots do not handle corrections well and snapshots are “data heavy” – a new record is cut even if nothing changes from last snapshot. It is possible to simulate snapshots on top of a bitemporal design.
Having multiple methods and strategies for storing current and historical data, and corrections introduces unnecessary costs and complexity. Multiple approaches make it difficult for report writers to remember how they need to treat each data source. The worst case scenario is that the data repository fails to become “self-serve” and instead business users have to ask the technical staff to write their reports. It is safe to say that the bitemporal approach leans more to higher fixed costs and lower operational costs compared to non-bitemporal approaches. The first time a regulator comes in and gives you 24 hours to answer a set of questions, the bitemporal design will be a relief.
Does Bitemporal data work with Big Data?
Assuming that your Big Data platform allows updates (Hive 0.14 and up supports updates), then you can implement bitemporal on it. How updates are performed under the covers is only important for discussions of performance. Note that the only update that ever occurs in the bitemporal design is to the AUDIT_TO_TS column – all other “updates” are reflected by creating a new row.
Implementation Patterns for Success
This section will focus specifically on my suggestions for successful implementation of the bitemporal design. Firstly: Education. You will need to sell it (not so much management but your colleagues – ETL developers, report writes, those who will use it daily). The bigger your organization, the more resistance to change you will encounter. Bitemporal design will be new to many/most of your team. If they don’t understand it, they won’t support it.
You’ll need to go further than just explaining what it is. You’ll need to provide real working examples, preferably with your real data. You’ll need to show how the consumption side works – how do you form queries to get exactly what you want. And you’ll need to explain/present this material multiple times – new concepts and complex concepts aren’t absorbed in one shot.
Secondly: Support. Developers, QA team, Report Writers will all get stuck at some point. You need to provide assistance when they need it. You may need to write some queries together so that the team member has a template for future situations. Some people stick with what they know and it may not be compatible with the bitemporal design. You’ll need patience.
In some respects, none of these suggestions are specific –any project is more successful by following these recommendations, just that it’s even more critical when adopting the bitemporal design.
“It’s too hard to use (to query)”
“It’s too much work (to implement)”
“We don’t have time (to implement bitemporal)”
“It won’t perform (queries will run slowly)”
I hear these concerns and complaints every time I introduce bitemporal design. But honestly, these concerns are not specific to the bitemporal approach. Often, any new or slightly more advanced concepts are met with these complaints. Hearing these complaints at the project start can actually be beneficial. It means the developers somewhat understand the concept if they can compare it to their current knowledge base and can declare the new way is more challenging. I generally see two groups form when bitemporal approach starts to sink in. One group “gets it” and become believers in the benefits. The other group just sees it as extra work.
Winning over the second group is actually most important to a successful implementation. They tend to be the practical ones – they don’t care too much of the design and instead want to successfully complete their work. They want to look good (who doesn’t?). It’s important to educate them (sit with them and work through working examples) and show them how to use the “system” and how they can be productive with it.
The bitemporal approach does introduce more join and where clauses to your SQL. The concern about performance might have been relevant 5-10 years ago when our hardware and software solutions were less mature. Query engines are much more optimized and can analyze where clauses effectively. I have yet to see the bitemporal design labelled the culprit in a slowly executing system.
This article really only scratches the surface of bitemporal data. The edge cases can be tough to implement, but at least there’s been thought put into the design for handling them. Luckily the edge cases are rare and you can fold them into the basic cases if desired. There are some topics I did not discuss, such as how you perform a bitemporal join and how you correct the past – these are best suited for a Part 2 which I hope to write if there is enough interest from this first article. If you like this article and want to learn more, let me know; also, consider reading the following book: Managing Time in Relational Databases by Tom Johnson and Randall Weis. I hope this article gave you a small taste for the power of the bitemporal design.
(About the author: Mike Lapenna is a data architect and modeller, and president of Inti Business Services Corrp. based in Toronto, Canada. Mike has over 30 years of technical experience and specializes in master data management and the bitemporal approach. Financial Services is Mike’s main client industry but the bitemporal approach is equally applicable in all sectors. If you have questions or comments on this article, Mike can be reached at firstname.lastname@example.org.)
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access