Continue in 2 seconds

Normalizing Reference Data

  • April 01 2004, 1:00am EST

For many years, the principles of normalizing database design have been well understood and are increasingly practiced. Yet many databases are created in third normal form, or even higher normal forms, only to be populated with reference data in a chaotic and nonstandard manner. This can have a negative impact that offsets the main advantage of a normalized database design, which is that the database design only needs to change if there are changes in the subject area that it supports. A strong case can be made that reference data needs to be normalized in its own special way if a database is to provide the best possible return on the investment in it.

The term reference data means different things to different people. The definition used here is: Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise.

This definition means that reference data tables are what are commonly called "lookup tables" or "domain values." These are typically tables with a key column containing a code and another column containing a description. Sometimes reference data tables can have additional columns, and these may include foreign keys from other reference data tables. Examples of reference data tables include country, currency, customer type and product category. Reference data tables rarely contain more than 200 records and often contain fewer than 20.

Reference data has an importance that approaches meta data in terms of the significance of individual data values. In part, this is because reference data is used to control business rules. If a business rule includes an actual data value, this will nearly always be reference data (e.g., a code). Another reason that reference data is important is that it defines both formal subtypes and hidden subtypes. Formal subtypes use category discriminators to identify the subtype to which records in the supertype belong. The category discriminators' values are housed in reference data tables. However, hidden subtypes are much more frequent in databases. These are subsets of records in a database table that are processed in ways that are different from the other records in a table. Perhaps a unique set of business rules applies to them or the records are reported separately from the other records in the tables. Hidden subtypes are identified by reference data values. Many important tables, such as customer or product, are surrounded by clusters of many reference data tables so that the hidden subtypes in product or customer can be identified and processed.

Reference data, therefore, acts like meta data in certain respects, but it is not treated like meta data. IT professionals seem to view defining a database as an IT task, but populating it with data is the job of the business users. From this viewpoint, reference data should be the responsibility of the users. In reality, however, reference data tables must be populated long before an application goes live - even prior to testing; therefore, this task is often performed by IT staff. Unfortunately, there is sufficient ambiguity and such a desire to differentiate between the roles of IT and the business users that reference data tables are often populated with little thought. The result is that the reference data is rarely "normalized," which is to say that the reference data values are not standardized in terms of domains they are supposed to describe. IT professionals pay great attention to the normalization of database design, but not to normalization of the contents of databases. Reference data tables typically constitute anywhere from 20 to 50 percent of a database in terms of numbers of tables. Therefore, a database whose design is in third normal form can still have up to half of its tables populated in such a chaotic manner that it is difficult to build an application on it or to utilize the data that it contains. If at least part of the effort that is typically spent on normalizing the database design were spent on normalizing the reference data, such situations could be avoided.

There are a number of ways in which reference data can be normalized. If these rules are applied, reference data tables can be populated in a clean and consistent manner that parallels and supports the normalization of the database design. Let us look at these rules one by one.

Definition of All Required Reference Data Tables

It is not uncommon to find database tables such as product or customer that contain code columns which should be foreign keys but for which no parent reference data table exists. The code values, therefore, exist only in the database table where they are found. If a new code must be added, it can only be added as part of a new record. If a list of the existing set of codes is required, it can only be made by extracting the codes from the table where they are found. Because no parent table exists, no descriptions are available for the codes, and no meta data is available for them (e.g., who added or updated the codes and when this was done).

Lack of parent reference data tables also means that a reporting dimension is missing from the database. This may not be a problem for transaction processing, but it is a problem for any reporting - especially if data must be reused in a data mart or warehouse.

Lack of defined reference tables appears to occur mostly in operational systems where programmers add a code column to an existing table for some newly discovered requirement. Adding the corresponding reference table is typically unnecessary for the functionality that the programmers are creating at this point in time. Unless the codes are directly meaningful to users, a major consequence is that the programmers must put the corresponding descriptions in program code. Defining data values in program source code is an unacceptable practice but, unfortunately, one that is all too common.

Therefore, it is important to define individual reference data tables to support all reference data used in a database. There should not be any "orphan" code columns in database tables that have no parent reference data tables.

Correct Level of Abstraction of Reference Data

Over-abstraction of reference data occurs where a reference data table is defined but contains data values that should really belong to different reference data tables or that represent different subtypes. These reference data tables typically lack any formal definition and have names such as product type, customer category or even record type. Because the tables have no formal definitions and their logical names are so vague, they can contain records that represent nearly anything. New records are added to such reference data tables for anything that can remotely be considered to represent a product type, customer category or record type. The result is that the tables contain sets of data that should really be broken out into many different reference data tables. For instance, the well-known "Currency Code" standard ISO-4217 contains not only currencies but also precious metals and special financial instruments and even one code ("XXX") that is reserved for testing purposes.

As with the lack of defined reference data tables, this problem seems to be caused by a reluctance to add more reference data tables to a database. The lack of well-written, formal, strict definitions of tables (represented by entities in data models) adds to this problem. Negative consequences are typically seen when the underlying data can have more than one value contained in the table. For instance a customer status reference data table may contain records for "Gold," "Silver," "Bronze" and "Suspended." If "Gold" customers do not pay their bills on time, they are set to "Suspended." In reality, these customers are both "Gold" and "Suspended," not just "Suspended" as this design implies. When the customers settle their accounts, it is no longer possible to tell what their preexisting status was. The obvious solution is to have two tables, one called customer preference status with records for "Gold," Silver" and "Bronze," and another table called customer preference status activity with records for "Active" and "Suspended."

Over-abstraction can also be solved by having subtypes in the reference data table. These can either be traditional formal subtypes or hidden subtypes where records are identified as belonging to a particular subtype by foreign keys from other reference data tables. A reference data table that has hidden subtypes will typically be surrounded by a cluster of other reference data tables with relationships to it.

Over-abstraction of reference data can sometimes amount to no more than a database table having contents that do not match its definition, a situation known as "semantic disintegrity." What is special about reference data tables is that they are so often defined in such vague (i.e., over-abstracted) terms, that almost any data value in them can be justified. This problem can be avoided by a stricter analysis of reference data tables and a readiness to introduce new reference data tables when necessary. Every effort must be made to make the contents of a reference data table match a reasonable level of abstraction that is represented by the definition of the table.

Resolution of Overlapping Reference Data Values

Reference data differs from other types of data in that individual values typically require definitions. Unfortunately, these are rarely present, and one of the problems that can arise is that data values overlap. For instance, a country code table may contain one value for "People's Republic of China" and another for "Hong Kong." Politically, because the People's Republic of China is united with Hong Kong, there should only be one record, unless the enterprise that uses this table wishes to have separate reporting for its activities in Hong Kong. If this is the case, then the definition associated with the record for "People's Republic of China" should state that this excludes all activities in Hong Kong.

Overlapping values in reference data are usually much more subtle than this example. They often arise from poor analysis or the reference data not changing to keep up with changes in the underlying business. The result is that business rules are fired incorrectly in operational systems and data is reported in the wrong categories in informational systems.

The problem of overlapping data values can be mitigated by detailed analysis of reference data that specifically searches for overlapping data values. The problem can also be mitigated by periodic reviews of reference data to ensure that it correctly reflects recent changes in the business. Reference data changes slowly, but it does change; and if 50 percent of a database consists of reference data tables, the aggregate rate of change is greater than most IT professionals suppose. Periodic review of reference data is therefore necessary.

Elimination Gaps in Reference Data Tables

Reference data values are often called "domains," meaning all the possible data values that can occur for the data element that the reference data table in question is supposed to define. For reference data tables that follow industry, national or international standards, the tables usually contain all existing values. However, for domains that are specific to a particular enterprise, often some values are missing.

The traditional way this problem is mitigated is to have one or more values in a reference data table called something such as "Other," "Unspecified," "Miscellaneous" or "Unknown." All the instances that do not fit with the well-defined values in the table are then put in one of these categories. This is what happened with the SIC (Standard Industry Classification) codes used by the U.S. Census Bureau, among others. The SIC codes were created in the 1930s; however, by the 1990s, the economy had changed so much that entire sectors were being classified using "catch-all" codes that essentially meant that no "real" SIC code matched them. As a result, the SIC codes have been completely superceded by the new NAICS codes.

The occurrence of gaps in the values within reference data tables can be mitigated through detailed analysis that specifically looks for gaps and periodic reviews that match the contents of reference data tables to the underlying business.

Elimination of Different Levels of Reference Data

When the contents of a reference data table are examined, it is sometimes quite obvious that some records represent a lower level of detail than others. For instance, a country code table that represents real countries will have a single record for "United Kingdom." However, a country code table that is used to represent tax jurisdictions may have "United Kingdom," "Isle of Man," "Jersey," "Sark" and "Alderney." The latter four of these values represent islands that are governed by the United Kingdom but which have their own special tax laws. If a single country code table that includes tax havens is also used for purposes that require only "real" countries, such as reporting on business activities, then many kinds of problems can arise.

This issue has been formally recognized in the realm of data warehousing with the requirement for "conformed dimensions." In terms of reference data, this means that all reference data values in a table should be at the same level of detail (or "granularity"). Unfortunately, reference data tables in operational systems are often used to drive business rules, and it is simply too easy to introduce a new value at a lower level of detail to drive a specific rule or set of rules.

Again, vigilance against this specific problem is the best way to mitigate it. This should happen at the time of initial data population, and reference data tables should be periodically reviewed for data at different levels of detail. The solution may be to introduce additional independent reference data tables or to build a hierarchy of related reference data tables. A more advanced way to mitigate this problem, and the problems of overlapping values and gaps in values, is to have a change control process for all reference data values that specifically detects these errors in the normalization of reference data.

Reference data controls much of the behavior of systems and how the data in databases is interpreted. It has specific ways in which it should be normalized. The most important of these are:

  • All required reference data tables should be present in a database.
  • Each reference data table should have a detailed definition and should contain data that is not abstracted to levels above this definition.
  • Reference data values within a table should not overlap.
  • There should not be gaps among the values in a reference data table.
  • All values in a reference data table should be at the same level of detail.

If reference data values are not normalized according to these rules, then even if a database design is in third normal form, the database will have serious problems in terms of the applications it supports and the usability of its data. By contrast, if these rules are followed, then the database as a whole will provide a much higher return on investment, with a corresponding reduction in risk that could arise from data quality problems associated with reference data.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access