Data administrators routinely work on tasks such as data modeling and maintaining corporate data resources. Often these tasks involve creating clear and accurate definitions — usually for entities and attributes. This can be a complex and even controversial task, especially when business users have conflicting ideas about important definitions. Exactly what, for example, is a "customer?"
Once definitions have been finalized, there is a temptation to think that the job of dealing with the semantic content of data is finished. Unfortunately, this is not necessarily the case, particularly where business users need to identify and classify the things they work with so they can process them or report on them as groups.
The issues surrounding the identification and classification of things are going to become more relevant as data administrators deal with more complex business problems, the volume of data increases, and the types of data become increasingly different. This article examines these problems in detail and provides a framework for handling them.
What Is the Problem?
Figure 1 presents a data model fragment that illustrates a typical situation for a hypothetical online brokerage.
Figure 1: Customer Subtypes in an Online Brokerage
In this data model, the Customer category is subtyped into Institutional Customer and Individual Customer. The Customer Type table has two records — one for Individual Customer and one for Institutional Customer. Each record in the Customer table has a Customer Type ID that categorizes it as an Institutional Customer or Individual Customer.
The brokerage may need to further categorize customers in some other way. For instance, there could be a "Preferred Customer" scheme of Bronze, Silver, Gold and Platinum Customers, who are categorized based on their number of online trades over the past year. Customers in these categories get certain additional services, regardless of whether they are individuals or institutions. We cannot introduce another subtype of Customer, so the usual solution is to create another table that contains the new classification scheme, and then associate this table with Customer. Figure 2 shows how this has been done.
Figure 2: Introduction of a Classification Scheme
The table called "Preferred Customer Classification" contains the set of possible values in the scheme (Bronze, Silver, Gold and Platinum), while the table called "Customer Preferred Customer Ranking" indicates which value a Customer has at a particular time. However, even within this simplified example, two problems tend to go unnoticed:
- How do we construct a complete classification scheme? Are the rules for assigning customers to each of our categories constructed so that we cannot place some customers in any category (a bad thing)? Or are there some instances in which the categories overlap (also a bad thing)
- If we look at a customer, can we — through our computerized system or via a human operator — reliably classify a customer as Regular, Silver, Gold or Platinum? If we cannot, the whole value of the scheme and the business need behind it may be put in question.
A Quick Tour of Taxonomy
Before considering these problems as they relate to data administration, it is useful to look at what another group of professionals has tried to do in this area. Biologists have been working with the issues of identifying and classifying living things for more than 200 years. A branch of their science, called "taxonomy," is devoted to this purpose. While data administrators can learn from taxonomists' successes, their failures are even more instructive.
The job of biological taxonomists is to help other people identify (put names to) the plants, animals and microorganisms that live on the earth. In the 18th century, taxonomists invented a binomial classification for naming living species. This consists of a genus name and a species name, for example, Homo (genus) sapiens (species) for man. Species were grouped into higher categories in a hierarchy, like order and phylum.
The key insight of taxonomists was that living things all have objective, observable characteristics. We can use these characteristics to identify a living thing as belonging to a particular species or to one of the higher categories. In Figure 3, we see that any animal with a backbone belongs to the category "Vertebrates." Certain vertebrates that have feathers are called "birds," while others with a different set of characteristics are called "mammals."
Figure 3: Key Characteristics of Certain Groups of Animals
|Has a backbone||X||X||X|
|Has sweat glands||X|
|Suckles its young||X|
The most important point is that the characteristics are objective and usually cannot be argued about. However, the classification scheme is a human invention, existing only in the minds of people.
Biological taxonomists were enormously successful in bringing an ordered approach to nature though the classification schemes they created and their descriptions of characteristics of living things. For instance, their schemes could be used to predict facts about an animal based on knowing facts about closely related animals. Unfortunately, problems soon sprang up. One problem was that when the theory of evolution came along, taxonomists decided their classification schemes were going to reflect the evolutionary lineages between species — a sort of family tree.
Ever since, there have been conflicting opinions about whether Species X is more closely related to Species Y or Species Z. All too often, each taxonomist with a different opinion gives Species X a different name and places it in a different place in the classification hierarchy. The result has been an increasing number of different names for the same species, which has then been assigned to different higher categories.
Users of taxonomic information can get very confused. For instance, a researcher who has found a fungus with antibiotic properties may want to know what species it is and what else is known about it. Different taxonomic works may lead him to a lot of different species names, and he may not be able to figure out which one really applies. He also may have to search the scientific literature for each of these names to find more information on the fungus — a time consuming task, compared to dealing with just one name. Taxonomists have turned the construction of a useful classification scheme into a search for "absolute truth." This may be interesting to other taxonomists, but it makes life difficult for users of the classification scheme.
Lessons to Learn
What can data administration learn from this experience? First, characteristics have to be separated from any classification scheme. Data administrators often try to bury many characteristics in a single definition of each entry in a classification scheme. Thus, the same characteristics are mentioned repeatedly in definitions for different entries in the scheme — a kind of first normal form problem.
Another problem is that the definitions may not be readily available. Consider Figure 4, which describes Security Type for a brokerage application. The plan may be to implement Figure 4 as a reference data table in a database. If this is done, Figure 4 will probably not have a column for Definition, which nearly always is placed in documentation. Sometimes there is no documentation, but even when there is, it may be difficult to locate and tends to become obsolete.
Figure 4: Security Type
|S||Stock||A security that has no maturity or expiration date and which conveys ownership in a legal entity|
|B||Bond||A security that has a maturity date|
|O||Option||A security that has an expiration date and is based on a stock|
|X||Other||Anything that is not a Stock, Bond, or Option|
Data administrators are much better at keeping attribute definitions up to date — something their data modeling tools capture — than definitions for actual values that occur in physically implemented database tables. Usually no thought is given to separating characteristics into separate reference data tables.
One major advantage of separating characteristics from a classification scheme is that it makes the classification scheme easier to construct. Instead of trying to grapple with complex definitions, the developer can see a list of characteristics. Each entry in the scheme has a unique set of characteristics, and no two entries can have identical sets. It also becomes easier to spot gaps in the classification scheme — valid combinations of characteristics that have no corresponding entries.
A natural place to look for characteristics is in the attributes defined for entities. Even so, many data administrators would agree that some characteristics represented in the definitions of classification schemes do not correspond to any attributes placed in the entities being classified. On the other hand, characteristics may be represented by attributes, and the question can be asked: Why should these characteristics be listed separately if they already exist in a data model?
For instance, a Security entity may have an attribute of Maturity Date. We may classify securities as "Bonds" if they have a maturity date. However, in a physically implemented security table, a record with a Maturity Date that is null may mean that we have not updated the record or that we are having problems finding the real value. Or perhaps the security represented by this record is not a bond and so will never have a non- null value in Maturity Date. In other words, the columns in a physically implemented database are for holding measurements of attributes, and a null value cannot always be used reliably to say that the characteristic represented by the attribute does not apply. Again, therefore, it is best to maintain a separate list of characteristics used by a classification scheme.
When biological taxonomists create lists of characteristics, they also create what they call "keys." A taxonomic "key" is like a decision tree. It is a series of "couplets" — questions with two choices, usually "yes" or "no." By answering the questions, someone can identify a plant or an animal down to the species level. Here is an example for identifying families of flowers:
- Does the flower have six petals
Yes: Go to 2
No: Go to 3
- Are the petals arranged with three on the inside overlapped by three on the outside
Yes : The flower is a lily
No: Go to 3
- The taxonomic key continues for other kinds of flowers.
Taxonomic keys often contain very technical language and are usually accompanied by a glossary of terms, illustrations of characteristics, and descriptions of the form, behavior and localities of the species they "key out." Some of this may be overkill for data administration, but it is a much better approach than assuming that a data entry operator simply "knows" how to classify something or at best can do side-by-side comparisons of several long and intricate definitions to make a judgement.
Data administrators probably can accept that lessons can be learned from biological taxonomists. However, modern information technology is clearly not the same as taxonomy, so there are some important differences. One is that there may be many different ways of classifying the same things — and in fact, this may often be necessary. Unfortunately, this is not always recognized. Typically, once a classification scheme is created, it is thought to be capable of serving any number of needs.
Going back to our example of Security Type, we may be able to classify securities as stocks, bonds, options and others. These categories may work for doing one requirement, like trade settlement, but not another, like tax calculations. The same characteristics can be grouped one way to categorize things for one purpose but may need to be grouped in a totally different way to be categorized for a different purpose. Of course, some characteristics are only relevant for one classification scheme.
Unless data administrators recognize this need to construct classification schemes for a purpose, people will tend to think that a given set of things, like securities, can only be described by one classification. They then get frustrated when this version of "absolute truth" does not work for the different processes and business rules that need to be applied to the things being classified. The most successful approach is to recognize that characteristics are objective — and that any number of classification schemes can be built upon them.
In conclusion, data administrators can be drawn into situations where they need to build a classification scheme. They also can face situations in which a classification scheme is difficult to apply. They can take several measures to overcome these problems:
- Build a list of all the characteristics that are important to the classification scheme.
- Map each of these characteristics to each proposed entry in the classification scheme to make sure there are no duplicates or unwanted gaps.
- Publish the list of characteristics and mapping so that everyone knows how to categorize the things being classified.
- Building a glossary of the terms along with the list of characteristics is helpful, as is the construction of a decision tree.
- Make a list of the goals that the classification scheme is designed to meet. There should be only one or a few
- Construct additional classification schemes if the goals are too divergent.
The tools to accomplish this do not really exist. However, understanding the problem is more important than the tools required to manage the solutions. In fact, much of this can be done within the confines of existing data modeling tools used in combination with a word processor.
Malcolm Chisholm, Ph.D. has over 25 years of experience in enterprise information management and data management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management, and business rules. His experience includes the financial, manufacturing, government, and pharmaceutical industries. He is the author of How to Build a Business Rules Engine and Managing Reference Data in Enterprise Databases and Definition in Information Management. He writes numerous articles and is a frequent presenter on these topics at industry events. Chisholm runs the websites http://www.bizrulesengine.com, http://www.refdataportal.com and http://www.data-definition.com. Chisholm is the winner of the 2011 DAMA International Achievement Award.