There is an increasing realization that master data is something enterprises must take control of if they are to share data effectively in applications, ranging from data warehouses to straight-through processing. A fundamental problem of master data is what identifies it. This is a major headache that places limits on how an enterprise can use master data as a corporate asset.

Before looking more closely at this problem, it is necessary to define master data. Master data is the data that represents the parties to the transactions that record the operations of an enterprise. Common examples include customer and product. Customer and product data must be established before a transaction to sell a product to a customer can be fired. IT has traditionally focused on the transactions and the data in them rather than on the data at higher levels of abstraction, such as master data, reference data and metadata. This is fine for stovepipe systems whose data is never shared, but it is a problem when any kind of data sharing or integration is attempted. It is even more of a problem when nobody knows what to call a particular item of master data.

What is a Widget?

Let us consider product as an example. You might think that every enterprise would uniquely identify each product in just one way. For a variety of reasons, this is rarely the case. A product typically goes through a lifecycle and is called different things at each stage in this lifecycle. It can be an idea, then a prototype. Perhaps the next stage is production of the product. Eventually, manufacture can cease, although warranty support is still offered. Finally, even this may end, but the enterprise may still have liability responsibilities for the product. During each of these steps in its lifecycle, the product may be represented by a different identifier. This sort of thing drives data administrators crazy. Their view is typically that a product should be uniquely identified by a single data value, and someone just needs to make a decision on what this will be.

It is never going to happen, and there are all kinds of reasons why. Often, product is identified by an intelligent key that simply must change with the lifecycle. I worked for an organization whose product was projects that they managed, and year was part of the identifier of the project number. By the time a project got into production, so many years had passed that they needed to revise the project number to refer to the current year. This looked good from a marketing and public relations perspective, but we were never able to match the projects prior to production with those in production because the project numbers had changed.

The Curse of Surrogate Keys

IT staff may blame business users for doing inconvenient things such as changing product identifiers, but IT staff often add to the overall confusion. Surrogate keys are one good way of creating chaos. More technical IT folks, especially programmers, love single column identifiers for master data. They think that surrogate keys make uniqueness easier to control as well as avoid problems with multicolumn keys. At least in theory, this solution also means that there is no need to change the identifier to conform with progression through product lifecycles. In reality, surrogate keys identify records in a database table, not instances of business entities such as individual products. They have no business meaning. Hence, when information leaves the boundaries of a given application, surrogate keys become useless, and we have to rely on other columns in our product table to figure out if a product in application X is the same as a product in application Y.

There is another problem with surrogate keys: they do not work well for human beings. They are wonderful things for computers and databases, but business users need to have more information attached to identifiers. Intelligent keys are identifiers that contain more than one piece of information. They can uniquely identify a product, but they can also convey additional meaning to anyone using them outside of a computer. Trying to work with surrogate keys is more difficult for business users because they look like random numbers. For instance, U.S. driver's license numbers have the license holder's birth date encoded in the license ID, which can be a useful reality check for anyone accepting a driver's license as valid identification. When intelligent keys are not used, there can be problems. I recently spoke with the staff of a major home builder that was having problems tracking customers' requests for upgrades. The codes that identified the upgrades for ceramic floors, faucets and paint colors were surrogate keys. The result was a high rate of transcription errors because the average salesperson could not tell from the code itself if they were dealing with a ceramic product, faucet hardware or paints. Had the codes been intelligent, the sales staff would have found it easier to perform reality checks, and there would have been fewer data quality problems. These kinds of issues are rarely recognized in technical debates over keys.

Competing Standards

Having standards is yet another way of identifying products. For instance, stocks, bonds and other financial instruments have an identifier called a CUSIP assigned to them. CUSIP stands for Committee on Uniform Securities Identification Procedures. The standard is owned by the American Bankers Association and operated by Standard & Poor's. The product master tables of brokerage houses are filled with information about different securities that are identified by their CUSIP numbers. Unfortunately, the CUSIP standard is not the only one. ISIN (International Securities Identification Number) and SEDOL (Stock Exchange Daily Official List) are other standards which include securities that do not have CUSIPs but also cover many securities that do have CUSIPs. This creates a big problem for any organization that wants to deal with a range of securities and interact with clients who may identify securities in ways that do not use CUSIPs. Once again, it seems that efforts to have one identifier for any product are thwarted.

There may even be mandated business requirements for multiple identifiers. Retailers often generate written sales quotations. A natural instinct for a customer in receipt of such a quotation is to visit other retailers in hopes of getting a better price. To preempt such behavior, retailers often take the product identifier of a supplier and give it an additional identifier that is internal to the retailer. The retailer's internal identifier plus a sufficiently vague description of the product is printed on the written quotation. This is bad news for any customer looking to do some easy comparison shopping.

Multiple Identifiers

The fact is, there are always going to be multiple ways to identify products. It is better for database designers, data administrators and enterprise data architects to embrace this reality rather than fight an unending war for purity of identification. Furthermore, it is not a huge leap to be able to implement designs that accommodate multiple identifiers for master data. Figure 1 shows a database design fragment that attempts to do this.

Figure 1: Handling Multiple Product Identifiers


In Figure 1, the Product table has a surrogate key as its identifier. The Product Identification Scheme table has one entry for each different scheme or standard for identifying a product. In the case of our security example, there could be one record for CUSIP, another for SEDOL and another for ISIN. To cover the product lifecycle, there could be one record for a laboratory ID when the product is in its conceptual stage, another record for when it is a prototype, another for when it is in production, another when it is no longer in production but still under warranty, and another when it is obsolete. The Product Identification table can have multiple records per Product. It associates the surrogate key used to identify records in the Product table with the different Product Identification Schemes used by the business. Each Product Identification Scheme must have a Product Identification Scheme Number that uniquely identifies the Product in question.

A nice aspect of this design is that there does not have to be a record in the Product Identification table if there is not one in the business. If a security has a CUSIP but not an ISIN, there is no record in the Product Identification table for the ISIN. If a product has a Laboratory ID but not a Prototype ID, there is no record on the Product Identification table for the prototype phase of the product lifecycle.  

Multiple identification of master data, like human nature, is an imperfect thing, but we are not going to be able to change it any time soon. The question, therefore, is how can we come to terms with it? The first step is to realize that multiple identification schemes do not exist just to make the lives of IT personnel difficult. There may be all kinds of business reasons why multiple identification exists.  

Multiple identification for master data exists and must be handled. Database designs can be created to implement this requirement. However, there are many more issues with master data management than multiple identification, and these other issues need to be tackled successfully if overall master data management is to be achieved. The good news is that the exploding interest in master data management is focusing attention on the entire spectrum of these issues, and we can look forward to some genuine advances in the field after many years of relative stagnation.


Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access