Master Data versus Reference Data

  • April 01 2006, 1:00am EST
More in

There is a tendency in IT to use generic approaches when managing the different aspects of the infrastructure we have to deal with. This is at odds with the increasing specialization we see all around us. Whether we are dealing with data, software, hardware, projects or anything else, the probability is that we will be working with something much narrower in scope than was usual just a few years ago. One-size-fits-all approaches, or very general advice that amounts to little more than "do the right thing" is unlikely to yield the desired results in such situations. Data is now being recognized as an area that is not a homogenous domain, but rather as consisting of a set of distinct categories. Different categories of data have their own unique characteristics. If these characteristics are not properly understood, practitioners will only be able to use generic approaches that are unlikely to deliver the desired results and may even be doomed to failure from the start.

Master data and reference data are two major data categories that are often thought of as the same. The reality is that they are quite different, even though they have strong dependencies on each other. Failure to recognize these differences is risky, particularly given the current explosion of interest in master data. Projects that approach master data as just "data" and fail to address its unique needs are likely to encounter problems. This is particularly true if such projects experience scope creep that leads them into the very different realm of reference data management.

Basic Categories of Data

Figure 1 shows how data can be separated into a number of different categories, arranged as layers. Perhaps the most important category is the transaction activity data. This represents the transactions that operational systems are designed to automate. It is the traditional focus of IT, including things such as orders, sales and trades. Below it is transaction audit data, which is data that tracks the progress of an individual transaction, such as Web logs and database logs. Just above transaction activity data is enterprise structure data. This is the data that represents the structure of the enterprise, particularly for reporting business activity by responsibility. It includes things such as organizational structure and charts of accounts. Enterprise structure data is often a problem because when it changes it becomes difficult to do historical reporting. (For example, when a unit splits into two, each responsible for a distinct set of products, how do we compare their current product sales performance to their performance prior to the split?) Enterprise structure data is a subject in its own right, which, alas, there is not enough space to discuss here.

Then we come to master data. Master data represents the parties to the transactions of the enterprise.

It describes the things that interact when a transaction occurs. For instance, master data that represents product and customer must be present before the transaction is fired to sell a product to a customer. Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.

Figure 1: Categories of Data

Know Your Data

If we accept these definitions, there is a big difference between master data and reference data. Definitions, however, are a cloudy issue. Some people tend to regard reference data as any data used in an application, but not created in that application. Thus a sales application that gets product data from some other application can view the product data as reference data. This is a big problem. If we do not have precise definitions for what we are talking about, then it is difficult to even exchange ideas, let alone implement solutions to address problems of either reference data or master data management. It is something I have witnessed in my work, and I have been particularly disappointed by projects where people with different and rather fuzzy ideas of what master data is try to work together. Unless everyone involved in such projects has a clear idea of what they are dealing with, they cannot understand where the boundaries of the projects lie, and they are forced back to very general and, frankly, fruitless approaches to whatever problems of reference or master data they are trying to solve.

The Difference of Identification

Let us look at some of the specific differences between reference and master data. Identification is a major one. In master data, the same entity instance, such as a product or customer, can be known by different names or IDs. For example, a product typically follows a lifecycle from a concept to a laboratory project to a prototype to a production run to a phase where it is supported under warranty and perhaps to a phase of obsolescence where it may no longer be produced or supported but is still covered by product liability responsibilities. In each of these phases, the name of the product may change, and its product identifier may, too. For instance, Microsoft's Cairo project was eventually named Windows 2000. I worked at an organization that funded special projects, and the year was part of the project number. When a project took a long time to be formulated, management usually changed the year node in the project number to give a more up-to-date impression. Beyond product, we are all aware that customers can change their names, or have identical names, and how difficult it is for enterprises that interact with a large customer base to know which individual they are dealing with.

By contrast, reference data typically has much less of a problem with identification. This is partly because the volumes of reference data are much lower than what is involved in master data and because reference data changes more slowly. Existing issues tend to revolve around the use of acronyms as codes. Reference data, such as product line, gender, country or customer type, often consists of a code, a description and little else. The code is usually an acronym, which is actually very useful, because acronyms can be used in system outputs, even views of data, and still be recognizable to users. Thus the acronym USA can be used instead of United States of America. Some IT staff try to replace acronyms in reference data with meaningless surrogate keys, and think they are buying stability by such an approach. In reality, they are causing more problems because reference data is even more widely shared than master data, and when surrogate keys pass across system boundaries, their values must be changed to whatever identification scheme is used in the receiving system.

Thus we can see that in the area of identification, quite dissimilar problems exist if we compare master data to reference data. A single approach is never going to adequately address identification problems in both categories of data.

The Problem of Meaning

Reference data has one unique property that it shares with metadata but which is totally lacking in master data. This is semantic meaning at the row level. We are all accustomed to the idea that metadata items, such as an attribute of an entity in a logical data model (or column of a table in a physical database) have definitions. It is a little less obvious that items of reference data also have definitions. For instance, what is the definition of USA in a country code table? Does it include Puerto Rico, Guam or the U.S. Virgin Islands? For some enterprises, it may only be the lower 48 states. Consider a database table of customer credit category. It may have rows for platinum, gold, silver, bronze and plutonium. The definitions of these rows are very important for interpretation of reports that are organized by customer credit category and for understanding what business rules may be triggered when a customer is assigned a particular customer credit category.

By contrast, definitions are meaningless for individual rows of master data. Customer A is just Customer A, and Product X is just Product X. Rows of master data do not have meanings. On the other hand, there can be huge disputes about meaning when it comes the to entity level in master data. What is a customer? What is a product? I would love to know how many millions of dollars have been wasted trying to get single enterprise-wide answers to these questions. It is a little like chasing rainbows. The reality is that the definition of master data entities depends on context. A marketing department may view prospects as customers, whereas for accounts receivable, a customer may only be somebody who has paid for a purchase. Understanding and managing these contexts and the various definitions that go with them is a major challenge in master data management.

Therefore, semantic issues are yet another significant difference between master data and reference data. The problem of getting, storing and making available definitions for individual rows of reference data is not the same as the need to understand the contexts and related definitions at the entity level in master data. These diverse challenges require very different solutions.

Links between Master and Reference Data

There are many other detailed differences between master and reference data, but there are also important linkages that complicate management approaches. Perhaps the most critical is the integration of the update cycle. When a new product, customer or other item of master data is introduced, there is always a possibility that new reference data will be required. Perhaps a new product will require a new product line or product category. This is particularly true in enterprises where the operationalization of the business is not well integrated with information systems. In such cases, nobody really understands the impact on data of introducing a new product as part of an entirely new product line. Perhaps it is not fair to blame business personnel for this, because they typically have enough problems to deal with in such a situation. Notwithstanding, in such cases, the addition of a new product record now requires the coordinated addition of a new product line record. If there is complete separation of master and reference data management, this can be a nightmare. It is particularly true if master and reference data are so separated that updates occur in different databases at different times. The result can be "orphan" product line and product records floating around the databases of the enterprise for a period of time.

What this shows is that while it is important to understand the differences between reference and master data, we must still think carefully about enterprise information architecture as a whole. The specific approaches that can solve the specific problems of master and reference data management must be set within a strategy of the overall management of enterprise information. None of this is particularly easy. What is exciting about the present moment in information management is that closer attention is being paid to these difficult problems, and the will and resources exist to try to solve them. Hopefully in the next few years we will see conceptual and practical innovations in master and reference data management that will be of enormous benefit to the enterprises that adopt them.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access