Master data and reference data are two major data categories that are often thought of as the same. The reality is that they are quite different, even though they have strong dependencies on each other. Failure to recognize these differences is risky, particularly given the current explosion of interest in master data. Projects that approach master data as just "data" and fail to address its unique needs are likely to encounter problems. This is particularly true if such projects experience scope creep that leads them into the very different realm of reference data management.
Basic Categories of Data
Figure 1 shows how data can be separated into a number of different categories, arranged as layers. Perhaps the most important category is the transaction activity data. This represents the transactions that operational systems are designed to automate. It is the traditional focus of IT, including things such as orders, sales and trades. Below it is transaction audit data, which is data that tracks the progress of an individual transaction, such as Web logs and database logs. Just above transaction activity data is enterprise structure data. This is the data that represents the structure of the enterprise, particularly for reporting business activity by responsibility. It includes things such as organizational structure and charts of accounts. Enterprise structure data is often a problem because when it changes it becomes difficult to do historical reporting. (For example, when a unit splits into two, each responsible for a distinct set of products, how do we compare their current product sales performance to their performance prior to the split?) Enterprise structure data is a subject in its own right, which, alas, there is not enough space to discuss here.
Then we come to master data. Master data represents the parties to the transactions of the enterprise.
It describes the things that interact when a transaction occurs. For instance, master data that represents product and customer must be present before the transaction is fired to sell a product to a customer. Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.
Figure 1: Categories of Data
Know Your Data
If we accept these definitions, there is a big difference between master data and reference data. Definitions, however, are a cloudy issue. Some people tend to regard reference data as any data used in an application, but not created in that application. Thus a sales application that gets product data from some other application can view the product data as reference data. This is a big problem. If we do not have precise definitions for what we are talking about, then it is difficult to even exchange ideas, let alone implement solutions to address problems of either reference data or master data management. It is something I have witnessed in my work, and I have been particularly disappointed by projects where people with different and rather fuzzy ideas of what master data is try to work together. Unless everyone involved in such projects has a clear idea of what they are dealing with, they cannot understand where the boundaries of the projects lie, and they are forced back to very general and, frankly, fruitless approaches to whatever problems of reference or master data they are trying to solve.
The Difference of Identification
Let us look at some of the specific differences between reference and master data. Identification is a major one. In master data, the same entity instance, such as a product or customer, can be known by different names or IDs. For example, a product typically follows a lifecycle from a concept to a laboratory project to a prototype to a production run to a phase where it is supported under warranty and perhaps to a phase of obsolescence where it may no longer be produced or supported but is still covered by product liability responsibilities. In each of these phases, the name of the product may change, and its product identifier may, too. For instance, Microsoft's Cairo project was eventually named Windows 2000. I worked at an organization that funded special projects, and the year was part of the project number. When a project took a long time to be formulated, management usually changed the year node in the project number to give a more up-to-date impression. Beyond product, we are all aware that customers can change their names, or have identical names, and how difficult it is for enterprises that interact with a large customer base to know which individual they are dealing with.
By contrast, reference data typically has much less of a problem with identification. This is partly because the volumes of reference data are much lower than what is involved in master data and because reference data changes more slowly. Existing issues tend to revolve around the use of acronyms as codes. Reference data, such as product line, gender, country or customer type, often consists of a code, a description and little else. The code is usually an acronym, which is actually very useful, because acronyms can be used in system outputs, even views of data, and still be recognizable to users. Thus the acronym USA can be used instead of United States of America. Some IT staff try to replace acronyms in reference data with meaningless surrogate keys, and think they are buying stability by such an approach. In reality, they are causing more problems because reference data is even more widely shared than master data, and when surrogate keys pass across system boundaries, their values must be changed to whatever identification scheme is used in the receiving system.