If we look at a data model, is it just a tangle of entities and relationships, or is there a structure that is not immediately apparent? Of course, we know that different entities can be grouped together because of relationships between them or because they participate in the same business processes. Even so, is there a structure within databases in general that is independent of a particular business domain and that does not rely on relationships between entities? This article argues that such a structure does exist, and that any database can be viewed as a successive series of layers of different kinds of data. Each layer has its own particular properties and constraints for the layers below it.
Before describing this general structure, it is worth questioning whether attempting to categorize entities in a data model is just an academic exercise, or if it has any real significance for data administrators and other IT professionals. Categorization can be helpful if it allows us to target specific data administration tasks only to those entities for which these tasks are relevant. No data administrator wants to blindly execute the same tasks for every entity in a data model if this is not necessary. In addition, there are other IT professionals who work with database designs and physically implemented databases who may also be able to benefit from an approach that lets them distinguish between different classes of entities or tables.
Introducing a Structure
A database records related facts of interest to human beings. Occasionally, these facts are about a business domain that never changes, but it is much more common that a database records facts about some kind of activity. Where there is activity, "things" happen, and these "things" are usually called transactions. The instances of each transaction are sufficiently similar that we can generalize the facts we need to know about them and design a database ahead of time to store these facts. If we accept that the basic role of a database is to record transactional data, then it is possible to generalize the design of the database as shown in Figure 1. Figure 1 shows that a database can be thought of as consisting of six layers of different kinds of data. Let us look at each one of these layers in turn.
The first layer is meta data, specifically the meta data that describes the design of the database. This layer of data must be created before any other layer. That is, we must design the database and then create the actual physical tables before we can record any facts about transactions. It may be argued that meta data does not constitute actual data in the database and therefore is not a real layer. However, the meta data does usually exist as data within the database of a data modeling tool. At least some of it must be implemented in the system catalog, or repository, of the RDBMS (relational database management system) in which the physically implemented tables reside. Also, many applications are actually built with tables that hold meta data that has been extracted from the data modeling tool. This can be very useful for supplying definitions of entities and attributes to users of the application and can be used to do very powerful things such as automating business rule definition.
Figure 1: The Six Layers of Data in a Database
The meta data constrains the layers of data that lie below it. It controls their design (and thus their structure), and through things like referential integrity it can also control their behavior. None of the layers that lie below the meta data influence the meta data in any way.
The next layer of data is reference data. This is often called "lookup tables" or "code tables." This data is used to categorize other data in the database, or to relate this data to the world outside the enterprise in which the database is implemented. Examples of reference data are country codes and currency codes. Very few enterprises create or manage countries and currencies. These are things which the enterprise does not process as transactions, but which exist in the world around us, and create a framework within which the enterprise's data must be reported for a general understanding of the data. Reference data can also include classification schemes such as SIC codes or credit ratings. There is also some reference data that is connected to database design. This includes entities that contain type codes and status codes. Type codes control the number of subtypes of a particular entity. For example, the Person entity may be subtyped into a Male Person entity and a Female Person entity with an associated Person Type table containing two records, one for Male Person and one for Female Person. Status codes are similarly placed in tables that control the life histories of entities.
Reference data tables are always parents of other tables in the layers below them in a database. There are no tables in the layers beneath reference data that are parents of reference data tables. Reference data can also constrain behavior of tables in the lower layers as is the case the type codes and status codes. Additionally, it is usual that reference data tables are fully populated before a database becomes operational.
Lying below the reference data layer is the layer of data that describes the organizational structure of the enterprise. This "enterprise structure" layer is often poorly represented, or entirely missing, in operational databases. However, individual units of an enterprise always have special roles and responsibilities in operational activities, carrying out the policies of the enterprise, and striving to attain particular goals. It is absolutely vital to describe the structure of the enterprise within the enterprise's information architecture to do things such as transaction workflow, financial accounting and monitoring business metrics that reveal the efficiency and effectiveness of individual units. Like reference data, the tables that hold enterprise structure data are parents of tables in the layers below them and have to be fully populated before any database that contains them becomes operational.
The next layer in the database is the transaction structure data. This is the data that represents the parties to any transaction. Common examples are Customer and Product, although Customer is often abstracted into a higher-level entity along with "Vendor," "Employee" and any other person or organization that plays a role with the enterprise. The parties to any transaction – that is, the things that interact to make the transaction happen – must be present before the transaction occurs. Of course, some databases are so poorly designed that this data is not captured; however, but the parties are certainly present in real life before the transaction occurs. Transaction structure data can represent some things that exist outside of the enterprise, such as people and organizations, a trait that is also true of reference data. However, unlike reference data, the things represented by transaction structure data have a significant interaction with the enterprise, and require that many more facts have to be collected about them. There is growing interest in this layer of data as enterprises seek to better understand the parties to their transactions. Customer relationship management (CRM) tries in part to provide a better picture of the behavior of customers, sometimes with facts gathered outside of the enterprise's transactions. Similarly, part of supply chain analysis includes the goal of better understanding vendors.
The entities that hold transaction structure data are parents of the entities in the layers below them and constrain their behavior. For example, an order cannot be processed for a customer or product that does not exist.
The next layer holds the transaction activity data. From the earliest days of IT, this has been the layer that has been the real target of data processing. In fact, this layer is so familiar it needs little introduction or description. It has a high volume of data that is of very significant interest to an enterprise. Transaction activity consists of instances of transactions such as orders, purchases, payments, trades, renewals and claims, all of which are very familiar to IT professionals. The volume of instances of these transactions and the speed with which they need to be completed make it impossible for most enterprises to use human beings to process them; therefore, computerization is essential.
Normally, individual transactions are thought to be represented by the most detailed data structures in any business domain. However, there is one other layer of data below individual transactions. This is the data that records changes to the individual transactions. It may simply be there to keep track of when and which operator last updated a transaction. Sometimes it is more elaborate, tracking more information about who changed transaction data, why this was done, when it was carried out, where it happened, etc. Certain transactions are so important to the enterprise that this kind of information is maintained with great care. For some transactions it may even be a legal or regulatory requirement. There is also another aspect to this layer of data – it is recorded in logs that servers maintain to track what they are processing. Hence, database servers are able to roll back transactions and Web servers can manage their resources. In general, it is rare that this kind of data is of interest to the enterprise, but it can be. It tends to be more relevant to the technology, whether it is server software or custom-coded applications that need information to help securely process transactions or provide assistance in debugging. Strangely, this layer of data is not really about business facts, but rather about the way they are processed. As such, it is a kind of meta data. We began at the highest layer with design meta data and have drilled down to the lowest layer, to once again encounter meta data.
Going from Layer to Layer
If we look at a database as consisting of these layers of data, we realize that there are certain progressions as we move from layer to layer. The first is that there is a radical increase in data volume. Regrettably, there is usually very little design meta data for databases. In fact, many people just work with the physical names of tables and columns, and datatypes of the columns – the minimum required to physically implement a database. There is a higher volume of reference data, with "code tables" describing things like country, currency, industrial classification, "record" type, and "record" status. These can constitute a few hundred or even a few thousand rows. Enterprise structure data tends to resemble reference data in terms of volume, though it can be quite extensive for large and complex organizations. However, there is a big increase in data volume when we move to transaction structure data. There can be hundreds of products, thousands of employees, and tens of thousands of customers. Each of these parties to the enterprise's transactions can have an enormous number of different kinds of facts recorded for them. Next, we come to the transaction activity, and here records can be in the millions. One of the problems with transaction activity data is that its volume can easily exceed a database's maximum capacity, and it needs to be moved out or archived. Lastly, the transaction audit data is typically so large that logs are cleaned out at regular intervals, and the data on them is not even archived.
Unlike volume, scope decreases as we progress through the layers. A single item of design meta data has enormous influence over the other layers. The scope of design meta data can range from a single column to relationships among several tables and affects every item of data stored in these structures. Design problems that require a restructuring of an operational database are universally feared. Reference data can also have a wide scope. If we look at a data model, there are often many relationships from each little reference data table to the other, more business-relevant tables in the model. A currency code could be used in every table where a monetary amount is stored. Thus, if an incorrect code is used, data in many tables may be affected. Scope declines when we consider transaction structure data. If we store incorrect facts about one customer, that may only affect the transactions in which that particular customer participates. Scope declines again when we move to the transaction activity data layer. Here, if something goes wrong with one transaction, it is only that transaction which is affected. Finally, at the layer of transaction audit data, the scope is limited to a single change event that happens in the life cycle of an individual transaction. If something goes wrong, we may have to repeat a piece of work that moves the transaction from one state to another, but it is rare that the entire transaction is lost.
There are many other properties besides scope and volume that experience quantum leaps when we move from layer to layer, but there is not space to discuss them all here. They include – but are not limited to – the importance of data quality, the rate of update activity, the points in time at which data values can be defined, the semantic content of data, and the life span of data.
The Usefulness of Thinking in Layers
By assigning entities to these different layers it becomes possible to be more efficient in performing data administration tasks. For instance, we know that the reference data layer has, in general, a small volume of data with a slow rate of update. Therefore, it is probably not worthwhile to conduct volumetric analysis on the tables in this layer in order to plan for database sizing and growth. Yet reference data tables can be represented by up to 50 percent of the entities in a data model. Without being able to categorize these entities as belonging to the reference data layer, we may actually try to gather information on the size and growth rates of all these tables, which may require a considerable effort. Conversely, we know that the reference data layer has a wide scope, and that data quality is very important in this layer. Therefore, if we are to conduct a rapid data quality assessment of a database, a useful task may be to examine all the tables of this layer to look at the data values they contain to determine if they are incorrect in any way. We could not do this quickly for one of the lower layers.
There are also certain unique properties of each layer. For instance, we know that the reference data layer acts as a bridge to the world outside the enterprise, and describes things that do not enter into the transactions of the enterprise. Thus, the data in this layer is very likely to have values defined in industry-wide or international standards, like Moody's credit ratings, or ISO country codes. This kind of consideration simply does not apply to the other layers, so we should not expect to be able to think of such standards for the entities they contain.
The categorization of the general structure of a database cannot solve all issues, but it can help us direct our efforts to the part of the database from which we can obtain the greatest return. This can help data administration functions that are often understaffed yet face an increasing workload as enterprises understand the strategic value of information, and as increasingly complex business models are reflected in information architectures.
Malcolm Chisholm, Ph.D. has over 25 years of experience in enterprise information management and data management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management, and business rules. His experience includes the financial, manufacturing, government, and pharmaceutical industries. He is the author of How to Build a Business Rules Engine and Managing Reference Data in Enterprise Databases and Definition in Information Management. He writes numerous articles and is a frequent presenter on these topics at industry events. Chisholm runs the websites http://www.bizrulesengine.com, http://www.refdataportal.com and http://www.data-definition.com. Chisholm is the winner of the 2011 DAMA International Achievement Award.