Note: A glossary of technical terms used in these articles can be found on the Web sites of the authors. The URLs are MindfulData.com and InbaseInc.com.

 

This glossary is being constructed as a controlled vocabulary. By that we mean that any expression involving temporal concepts which is used in the definition of a glossary entry must itself be a glossary entry.

 

One reason for being this careful with definitions is that this kind of rigor must be applied to definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order logic. That, in turn, is what is needed to make definitions available for manipulation by software-realized inferencing engines.

 

Another reason for being this careful with definitions is that this is the kind of rigor that must be applied to any set of technical definitions before we can claim not only that we say what we mean, but also that we know what we mean when we say it.

 

As this series continues, context becomes increasingly important so that the thread of the discussion is not lost. Here is what we've done so far.

 

(See PDF below for Figure 1: Chart of Installments to Date in this Series.)

 

Beginning with Part 19, we interrupted our presentation of the physical implementation of the versioning patterns, which we described earlier in the series, and began a discussion of temporal integrity constraints. On completing Part 24, we had introduced and defined temporal RI and related concepts, and discussed these constraints as they apply to delete and to update transactions. In this and the following article, we turn to temporal integrity constraints as they apply to insert and to upsert transactions.

 

The "ultimate" versioning pattern we are working toward is what we, following the computer science community, call a "bi-temporal" pattern. In Parts 19 through 25, we have been discussing both temporal entity integrity and temporal referential integrity constraints. But it is important to note that these discussions have taken place in the context of a "uni-temporal" pattern. The only time period considered has been an effective time period, and the only dates an effective begin and effective end date.

 

As we have seen, integrity constraints in a uni-temporal versioning context have been far from simple. As we will see, integrity constraints in a bi-temporal versioning context are considerably more complex than that.

 

Inserting a Versioned Object: the Original Insert Transaction

 

To the business, when they are issuing an insert of an object, they don't care if the object is versioned or not, and indeed may not know. They are directing us to insert a row representing an object which is not already represented by a row in the target table. Thus, an original insert transaction doesn't need to specify a version, just the object being inserted, its business key, whether or not match logic should be applied, and the date the insert is to take effect. This is illustrated in Figure 2.

 

As we have pointed out before, we are using dates throughout these articles only for convenience. The approach to versioning which we are presenting here applies no matter what the granularity of the clock ticks that measure out time. The granularity used here would be the correct granularity for transactions taking place in batch mode and being applied at most once per day. As soon as online individually applied transactions are involved, however, we would need a finer granularity, such as a full timestamp.

 

 

Before we begin to examine an original insert transaction in detail, let's look specifically at the "Match? (Y/N)" flag in Figure 2.

 

Inserting a Versioned Object: Haven't We Met Before?

Objects of interest to an enterprise frequently appear, go away, and then reappear later on. One response to the fact of recurrence is to ignore it. When the object reappears, it is treated as a new object, and no attempt is made to match it up with the history we have for it.

 

But enterprises generally prefer to recognize a recurring object rather than treat it as a new object. For example, if the object is a customer, then if we don't attempt to determine whether or not she used to be a customer of ours, we won't be able to greet her and make her feel special, like a long-lost friend. We won't be able to say, "Welcome, back. Glad to have you as a customer once again." We will also be unable to use the history of transactions we have with this customer to more intelligently manage the reestablished relationship. We know the kinds of things she likes to purchase from us. We may even know the kinds of advertising she is most responsive to. But unless we can recognize her as a returning customer, we will lose the competitive advantage that this past history can give us.

 

So the preferred response, if our target table is a versioned table, would be to look for a match of the object being inserted with a past version. Assuming that the primary keys of rows in the target table have not been back-propagated to their populating source systems, we must look for a match using the source system business keys of the objects. If we find a match, we will use the target table EID we found as the EID for the insert. This means, for a versioned target table, that it will be used as the EID for the first version of a new episode of the object. With this EID firmly attached to all the versions of all the episodes of the object, we retain the information we can glean from past episodes. We can say "Hi, and welcome back" to our returning customer. We will already know what she likes, and what she doesn't like.

 

But notice the [Match? (Y/N)] flag on the original transaction. Why do we need it? If we keep the business key on our versioned tables, and a business key is provided on the transaction, why can't we automatically check for a match?

 

In a perfect world, with objects that are assigned unique business keys, we would not need this match flag. But in the imperfect world of often dirty data that we IT professionals must manage, sometimes we can't rely on the business key. Sometimes we must even allow duplicate business keys, or keys with default values or {null}s in one or more columns, into our target tables.

 

For example, multiple independent source systems might provide insert, update and delete transactions to the same target table. The same business key might be used in both systems, and the two rows thus designated might or might not represent the same object. One way we could find ourselves in this situation is if the two source systems were, up to now, managed rather "loosely," as systems for quasi-independent organizations. Mergers and acquisitions often start off with this "loosely coupled" approach because it is much quicker and much easier to bring online than full integration. But when the customer bases of the acquiring and the acquired companies overlap, as they often will, it is inefficient to manage the customers in common as different customers. It is often an inefficiency visible to those customers as well, whose impressions of our company will certainly not be enhanced by this annoying inability of ours to recognize them as one and the same customer.

 

Another example would be overlapping product lines across the manufacturing plants of an acquiring and an acquired company. Here, too, there is a loosely coupled and a tightly coupled approach. Here, too, moving from the former to the latter will likely uncover latent business key collisions and make them fully explicit.

 

These M&A scenarios are often exacerbated by the not infrequent practice of using a system-generated sequence number as the last column of a business key. When this happens, the odds of identical cross-system business keys representing different objects quickly escalates. For example, if each of two systems has three business keys, each set identical except for the sequence number portion, and both sets identical in the non-sequence number portion, and if we assume that there are indeed exactly three customers involved, the odds of matching them correctly across the two systems is nine to one, against!

 

Another way business keys can be compromised is when a source system uses a surrogate key, one which, for example, recycles every couple of years. By the time it recycles, all reference to the object it originally represented will have been archived out of that system. But if our target is a versioned table with a longer online lifecycle than the source table, then when the reused surrogate appears, it may still exist in the target table as an identifier for the previous object it designated.

 

In these ways, or in any of the other ways in which dirty data can make its way into the very business keys that identify the objects we are interested in, if we automatically search for a match on original insert transactions, we may inadvertently create synonyms, multiple rows which represent the same object. For this reason, original inserts must indicate whether or not match logic should be applied to them. But because the same semantics often need to be provided even when the target table is not versioned, requiring the match flag does not "tip our hand" and indicate that the target is a versioned table. The distinction between versioned and conventional tables remains hidden from the users and IT personnel who provide the source transactions.

 

Unreliable Business Keys: Why Assume a No-Match?

 

If we always assume a no-match when dealing with unreliable business keys, the effect is to turn every source update transaction into an insert. The result quickly becomes a situation in which there are a large number of rows for each object that is frequently updated. In business IT, we look for ways to keep these numbers down, but it's not our purpose to describe those heuristics here.

 

But one could ask why the assumption must be that a match of unreliable business keys is a no-match. Perhaps application-specific circumstances mean that the odds are nine out of ten that such matching keys do in fact represent the same object. In that case, why not assume that they do?

 

The reason, as all reasons are in business IT, is a negative impact on the bottom line. In many cases, the cost of creating one mistaken homonym is higher than the cost of creating nine, or even a hundred, synonyms. In this case, "homonym" refers to one row representing multiple objects, and "synonym" refers to multiple rows representing the same object.

 

In these terms, the question, "Why assume a no-match?" can be rephrased as, "Why assume that homonyms are more costly than synonyms?" And the answer is that in general, they are. Here's why.

 

In the case of synonyms, when we discover one, we have to do what is called a "file merge." If two Client table rows are discovered to represent the same client, for example, we must replace them with a single row. As for source transactions, we must change the match logic to point pairs of transactions hitherto directed to different targets, to the one new target which replaced them. As for RI dependencies, we must take all the children RI-dependent on either of the two original rows, change their foreign keys to point back to the one new row which replaced them. But notice that all these changes can be done with code. There is no need for human judgment in the revised match logic, or in the new RI dependencies.

 

But such is not the case when we are dealing with homonyms. When we discover a homonym, we must do what is called a "file split." Say we split one client table row into two rows. The next time update transactions appear that had both been applied to the original row, how do we determine which of the two new clients to direct each one to? By the very fact that the homonym existed to begin with, we know that source systems have thus far failed to make the distinction. So the match logic must be updated to make a discrimination it previously did not, or could not, make. As for RI dependencies, with a single parent client row replaced by two (or more) client rows, how are we to know which new parent to redirect each RI child to?

 

Occasionally, the means to make these new discriminations will exist in the data, and so the new logic can be fully automated. But far more frequently, those means do not exist in the data. In that case, file splits can only be done if we can devote human resources to the one-time task of fixing foreign keys, and to the recurring task of augmenting match logic to make discriminations that cannot be made on the basis of the data alone.

 

In short, doing less discriminatory work is easy; doing more is hard. That is why homonyms cost more than synonyms. And that, finally, is why in the absence of cost/benefit data to the contrary, we should treat unreliable source system business keys by permitting duplicate rows into the target tables.

 

Sometimes, of course, a cost/benefit analysis in a specific situation will indicate that it is better to create the occasional homonym than it is to create a flurry of synonyms. But the default assumption always has to be that homonyms are to be avoided, and synonyms reluctantly tolerated.

 

We have spent this much time on the business key match problem because it is so important. In the past, the problem was often ignored, and each doubtful situation treated as an insert, as a new object situation. The reason was always the difficulty in producing reliable matches in the presence of match criteria data of poor quality, data which often came from outside our enterprise and so was beyond our ability to clean up, or in the absence of common match criteria across different source systems.

 

But throwing up our hands and populating source tables with an unknown and possible significant number of duplicate rows, rows which represent the same object, is increasingly unacceptable. Increasingly, businesses believe that the high cost of solving the problem, or at least reducing its impact, is less than the cost of living with it. We have several times used the example of merging customer files in the presence of unreliable source system keys. We chose this example because the high cost of living with un-merged customer data is most apparent in customer-facing systems.

 

We will have more to say about a general approach to keeping the cost of duplicate-containing tables to a minimum in a later set of articles. We think there is something of value to describe that will apply to nearly all situations in which unreliable business keys must be matched. While this is not a problem exclusive to versioned tables, neither is it a problem exclusive from them.

 

In the meantime, we will turn out attention back to inserts and upserts to temporal tables in our next article.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access