Before looking more closely at this problem, it is necessary to define master data. Master data is the data that represents the parties to the transactions that record the operations of an enterprise. Common examples include customer and product. Customer and product data must be established before a transaction to sell a product to a customer can be fired. IT has traditionally focused on the transactions and the data in them rather than on the data at higher levels of abstraction, such as master data, reference data and metadata. This is fine for stovepipe systems whose data is never shared, but it is a problem when any kind of data sharing or integration is attempted. It is even more of a problem when nobody knows what to call a particular item of master data.
What is a Widget?
Let us consider product as an example. You might think that every enterprise would uniquely identify each product in just one way. For a variety of reasons, this is rarely the case. A product typically goes through a lifecycle and is called different things at each stage in this lifecycle. It can be an idea, then a prototype. Perhaps the next stage is production of the product. Eventually, manufacture can cease, although warranty support is still offered. Finally, even this may end, but the enterprise may still have liability responsibilities for the product. During each of these steps in its lifecycle, the product may be represented by a different identifier. This sort of thing drives data administrators crazy. Their view is typically that a product should be uniquely identified by a single data value, and someone just needs to make a decision on what this will be.
It is never going to happen, and there are all kinds of reasons why. Often, product is identified by an intelligent key that simply must change with the lifecycle. I worked for an organization whose product was projects that they managed, and year was part of the identifier of the project number. By the time a project got into production, so many years had passed that they needed to revise the project number to refer to the current year. This looked good from a marketing and public relations perspective, but we were never able to match the projects prior to production with those in production because the project numbers had changed.
The Curse of Surrogate Keys
IT staff may blame business users for doing inconvenient things such as changing product identifiers, but IT staff often add to the overall confusion. Surrogate keys are one good way of creating chaos. More technical IT folks, especially programmers, love single column identifiers for master data. They think that surrogate keys make uniqueness easier to control as well as avoid problems with multicolumn keys. At least in theory, this solution also means that there is no need to change the identifier to conform with progression through product lifecycles. In reality, surrogate keys identify records in a database table, not instances of business entities such as individual products. They have no business meaning. Hence, when information leaves the boundaries of a given application, surrogate keys become useless, and we have to rely on other columns in our product table to figure out if a product in application X is the same as a product in application Y.
There is another problem with surrogate keys: they do not work well for human beings. They are wonderful things for computers and databases, but business users need to have more information attached to identifiers. Intelligent keys are identifiers that contain more than one piece of information. They can uniquely identify a product, but they can also convey additional meaning to anyone using them outside of a computer. Trying to work with surrogate keys is more difficult for business users because they look like random numbers. For instance, U.S. driver's license numbers have the license holder's birth date encoded in the license ID, which can be a useful reality check for anyone accepting a driver's license as valid identification. When intelligent keys are not used, there can be problems. I recently spoke with the staff of a major home builder that was having problems tracking customers' requests for upgrades. The codes that identified the upgrades for ceramic floors, faucets and paint colors were surrogate keys. The result was a high rate of transcription errors because the average salesperson could not tell from the code itself if they were dealing with a ceramic product, faucet hardware or paints. Had the codes been intelligent, the sales staff would have found it easier to perform reality checks, and there would have been fewer data quality problems. These kinds of issues are rarely recognized in technical debates over keys.
Having standards is yet another way of identifying products. For instance, stocks, bonds and other financial instruments have an identifier called a CUSIP assigned to them. CUSIP stands for Committee on Uniform Securities Identification Procedures. The standard is owned by the American Bankers Association and operated by Standard & Poor's. The product master tables of brokerage houses are filled with information about different securities that are identified by their CUSIP numbers. Unfortunately, the CUSIP standard is not the only one. ISIN (International Securities Identification Number) and SEDOL (Stock Exchange Daily Official List) are other standards which include securities that do not have CUSIPs but also cover many securities that do have CUSIPs. This creates a big problem for any organization that wants to deal with a range of securities and interact with clients who may identify securities in ways that do not use CUSIPs. Once again, it seems that efforts to have one identifier for any product are thwarted.