Many enterprises are undertaking primary key reengineering projects for their core business tables. These tables keep information on an enterprise's suppliers, partners, customers, orders, locations, employees, and so forth. The purpose of these projects is to replace the "intelligent keys" of these tables with "unintelligent" keys.
The cost of these reengineering projects is often very high, and for one "key" reason: when the primary keys of a table are changed, all their foreign key instances must be changed, too. And the foreign keys of an enterprise's core tables are everywhere! They occur in other tables in the same database and also in tables in other databases such as the enterprise data warehouse, downstream data marts, operational data stores and online or batch transaction processing systems. They may even occur in the databases of other companies with which the enterprise does business.
Referential integrity is usually enforced only within the physical database that contains the core table itself. But the instances in tables in other databases are also foreign keys, and they must be changed, too. Moreover, foreign keys exist not just in these online databases, but also in off-line electronic archives not to mention microfiche and hard copy where, of course, they cannot be changed.
Consider the effort to propagate changes in just one table's primary keys to their foreign keys in that table's own database and to their foreign keys in physically distinct databases. Ideally, a change to a primary key value should be done as one atomic transaction; but this usually isn't possible when large, physically distinct databases are involved. So there is a period of time often weeks and months, if many databases are involved over which the changes take place. During this time, joins will fail for the foreign key instances which haven't yet been changed. Consequently, the databases will report incorrect information when queried.
If we want to avoid reporting incorrect information on important business objects, we must use some kind of cross-reference table until all foreign key instances have been changed to the new values. Cross-reference tables track which foreign keys have been changed to the new values and which ones have not. They are useful only if all queries that contain joins based on the foreign keys are altered to add an intermediate join to the cross-reference table; and this can be a very expensive process. What's more, after all of the foreign key instances have been changed, the cross-reference table must be discarded. Consequently, the queries have to be changed back to make the direct join and not use the cross- reference intermediary. Again and again, with one database after another, this lengthy and expensive process must take place until all the foreign key instances of all the primary keys of that core table have been changed.
Thus, reengineering the primary keys of just one core table, for a large enterprise, can take a year or more and millions of dollars. I'll call the factor that accounts for this tremendous time and cost "the foreign key ripple effect." On the bottom line, it outweighs all other costs of reengineering primary keys by perhaps an order of magnitude.
With this kind of cost for these reengineering efforts, we'd better do it right the first time! However, every primary key reengineering project that I have seen or heard of is doing it the wrong way, by reengineering to unintelligent keys only. These reengineering projects will only solve part of the problem because they are based on an incomplete understanding of the semantic requirements for unique identifiers in databases. If these projects, whose objective is to reengineer from intelligent keys to unintelligent keys, are allowed to continue, they will have to be followed by a second round of equally costly projects. The second time, the objective will be to reengineer from unintelligent keys to what I call "enterprise keys."
The good news is that we can reengineer from intelligent keys directly to enterprise keys. We don't have to first reengineer to unintelligent keys and then reengineer a second time to enterprise keys. In this series, I will show that the problem is indeed as serious as I have said and that it is indeed avoidable. I will also show how to avoid it.
Entity and Object Integrity
Basically, the problem is that both intelligent and unintelligent keys satisfy the entity integrity constraint, but neither satisfy the constraint of "object integrity." Entity integrity is the constraint that the primary key values in a relational table must be unique across all of its rows. Object integrity, as this article will explain, subsumes entity integrity under a stronger requirement. It requires that every primary key instance must be different from the other primary key instances of its own table and also from all primary key instances of an entire set of other tables.
That entire set of tables, across which a primary key value must be unique, forms a "namespace." For our purposes, a namespace is simply a set of tables across which the identifier for every row must be unique. This namespace might be a single physical database; but the larger the namespace, the greater the benefits of object integrity. So an enterprise would do well to include all of its principal databases in its initial namespace. Later, a namespace shared with the enterprise's principal suppliers, partners and customers could be created at little incremental cost.
Today's RDBMSs do not enforce object integrity. Consequently, a database can contain two or more rows (each in a different table) which happen to have the same value for their primary keys. It happens this way. Sometimes, the primary keys of two or more tables in a database have the same "syntax" (i.e., the same data type and length) and also identical or overlapping domains (permissible values). When we have these two conditions, the primary key of a row from any of these tables may have the same value as a primary key from a row of any of the other tables. For example, if a purchase order table and an invoice table both have primary keys consisting of a date plus a sequence number, then every day there will be n pairs of orders and invoices with identical primary keys, where n is the smaller of the number of orders or of the number of invoices produced that day.
Enforcing object integrity instead of just entity integrity is one of the ways in which relational databases must become more object-like. The object integrity constraint for primary keys is included in the draft SQL3 standard, although the standard does not require it to be used. It will be several years before its implementation is both ubiquitous and standard. In the meantime, expensive reengineering projects are in planning or are already underway.
Let's see, first, why unintelligent keys are an improvement over intelligent keys and then why enterprise keys are an improvement over unintelligent keys, After that, we'll see how to implement enterprise keys.
For many years, intelligent keys seemed like a good idea. An example is a customer number. See Figure 1.
|other atrributes |
Figure 1: Intelligent Key Customer Numbers
Let's suppose that the first three characters of cust- nbr indicate the city of the customer's bill-to address and the last four are a sequence number to make the key unique. By now, data modelers realize that intelligent keys like this aren't such a good idea. The problem in this example is that when you try to create the 10,000th customer for any city, you can't do it because there is no more room in the sequence number range. The only solution is to do some kind of "split."
One kind of split lets the intelligent key get "dumber." For example, "ATL" may remain "ATL" for the existing 9,999 customers in Atlanta, and "AT2" can be used for the next 9,999 customers. But now the key isn't so intelligent anymore, because the distinction between "ATL" and "AT2" doesn't mean anything. These are just two synonyms for Atlanta.
Synonyms in data always create problems. Here, for example, all queries and reports that look for or that count the number of customers in Atlanta must now use "'ATL' OR 'AT2'" in their WHERE clauses, instead of just "ATL." As customers in other major cities reach the 10,000 mark, the problem only gets worse. When a city reaches 20,000 customers, the problem occurs again.
A second kind of split keeps the intelligence in the intelligent key. For example, "ATL" can change its meaning to designate just the city of Atlanta, and "ATS" can be added to handle customers who are in the Atlanta metropolitan statistical area, but outside the city limits.
One problem with this kind of split is that the meaning of "ATL" has changed. For all queries and reports that ask for customers in Atlanta, we must now decide whether we want just customers inside the city of Atlanta, just the non-city metropolitan area customers or both.
Another problem is that by now the customer numbers for those first 9,999 Atlanta customers are embedded in the enterprise's databases, and a large number of them are going to have to change. All existing Atlanta customers who live outside the city limits will have to change to an "ATS" customer number. They will have to change wherever they occur; and this foreign key ripple effect is often a tidal wave, capable of swamping a project well before it reaches its destination.
For these reasons, most data modelers now realize that we need "unintelligent" keys; and we persuade our companies to fund projects to create a new key a new customer key, in this case.
In large enterprises, these projects can be very expensive. The result is a key that is "impervious to change" because it doesn't describe the customer in any way. Because it contains no descriptive information, the user has no reason to request a change to it. If the systems which access customer information are reengineered to use the new unintelligent key correctly, that new key can even be completely hidden from the user.
The user, however, will still want his "business key." He'll say, "I don't know anything about these fancy 'unintelligent keys,' and I don't want to know. What I do know is that I'm not going to give up my customer number! I just want to enlarge the sequence number part to eight positions. That should give me plenty of room to grow."
The information technology (IT) department isn't going to win this battle. Nor should it. Instead, as part of the project to create an unintelligent key for customers, IT should keep the original customer number and just take away its role as a primary key. However, the database administrator (DBA) should define a unique index on the non-key customer number column, because uniqueness is still a requirement for it.
Now we have a business key and a primary key. The business key is how the user identifies a customer. The primary key is how the DBMS identifies the customer and also how it implements joins to customer-related tables. The business key is not used as a foreign key, but it can continue to be used as a business-perspective unique identifier. The result is a customer table as shown in Figure 2.
|cust-nbr id |
|cust nbr business key |
(non-key but unique)
|other attributes |
Our original 9,999 customers in Atlanta can now keep their original customer numbers, as far as what the user can see. Internally, however, the customer number business key is defined as a char(11), and the original numbers are padded with four spaces. Our 10,000th Atlanta customer would receive business key "ATL10000 " (padded with three spaces).
This solves our problems with the intelligent customer number key. The user gets to keep his intelligent customer number and gets the sequence number part of it enlarged, just as he requested. The database and the SQL statements which access it, on the other hand, now have an identification mechanism (primary keys) and also a relationship mechanism (foreign keys) which are impervious to change. Because they do not use the unintelligent primary and foreign keys, the users have no interest in changing them. Indeed, since they are merely implementation mechanisms of the DBMS, the users shouldn't even see them or know of their existence.
These days, most modelers would say that unintelligent keys are the kind of keys we need for our relational tables. The reengineering costs to convert to unintelligent keys will be high, but our predecessors made a mistake by using intelligent keys. We know that it must be corrected; and large enterprises are undertaking multi- million dollar reengineering projects for each of their major entity types suppliers, partners, customers, products, personnel, organizational structures, locations, purchase orders and invoices.
What they really need, as the primary keys of their core tables, are enterprise keys. These are primary keys which have all the semantic features of object identifiers (OIDs). As such, these enterprise keys will fulfill what I call the "object integrity" constraint. This constraint must replace the entity integrity constraint which all today's relational DBMSs support.
What are these features of relational primary keys that bestow object integrity on them? There are three of them.
- There must be no descriptive information in the key.
- The key must provide unique identification across an entire database or even across a large number of databases not just across the rows of a single table.
- There must be no type information in the key. In object contexts, this means that the key must not indicate the class of the object. In relational contexts, as we shall see, this means that it must be possible to create supertype/subtype hierarchies without requiring changes to the syntax of the primary key of any table or changes to the primary key values of any of the rows in those tables.
Unintelligent keys fulfill the first criterion, but not the second two. It's their failure to fulfill the second two criteria that makes them unacceptable as semantically complete identifiers.
In Part 2 of this series in the March 2000 issue of DM Review, we'll see why this is so and how enterprise keys solve this problem.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access