We use classifications in many aspects of our daily lives. In the data world, we use classifications to categorize our business data so we can effectively identify a group of similar items. A well-formed classification enables business users to locate and perform analysis on exactly the subset of data they want. A well thought-out classification enables incorporation of new categories as the environment changes.
Before diving into some guidelines for constructing and testing classifications, let’s start with terminology.
Taxonomy: The science or technique of naming, identifying and classifying. This article explains aspects of a general taxonomy.
Domain: A group of items or objects characterized by a specific feature. The two domains used as examples in this article are financial securities and songs.
Classification (noun): A collection of ordered categories (delineated along one or more features) for a specific domain.
Category: A subset of similar domain objects. A category can be of group of more detailed categories.
Level: All the categories with the same parent category (as in multilevel classifications).
As a simple example, consider a classification of music – you might identify the categories (or genres) as classical, rock, country, pop, jazz, or hip hop. Consider food: Vegetables, legumes, fruit, meat, fish and dairy are all categories of the classification. If we wish to have more specificity of vegetables, we might have root vegetables (e.g., carrots, beets), plant vegetables (e.g., collards, lettuce, peas) and so on. When we include subcategories of vegetables, we are now talking about a multilevel classification.
The benefits of classifications are numerous. We refer to classifications when we communicate with colleagues. We name classifications so that systems can label them and attach qualities to them.
Well-documented classifications become part of our language. Classifications improve our communication; they reduce probability of interpretation errors or reporting errors, and provide the ability to differentiate, compare and analyze. From a purely utilitarian standpoint, they enable us to more quickly locate a single data item we really want. A classification enables a tree-based search – for each level we traverse, the closer we get to our desired target domain object.
The larger the universe of data items being classified, the more valuable the classification. A large universe of domain objects lends itself well to a multilevel classification. For a large universe, a single level classification likely has too many domain objects associated with each category, reducing its ability to target a small, targeted group. In financial services, would an asset management firm be happy with classifying all its securities into just one level such as Equities, Fixed Income, Commodities, Currencies, and Derivatives? It is unlikely the firm would b e content with this, because a good-sized asset management firm may keep track of more than 20,000 securities.
Constructing the Classification
When building classifications, ensure all categories apply to the same domain. Getting back to music, categories such as ‘70s or ‘80s songs would not be consistent with genres. However, you could have a separate time-based classification. Both genre and decade would be fair - but separate - classifications.
Here are a few points to consider when building a new classification:
- Give the classification a name so it clearly represents its domain.
- When considering a category, ask yourself whether it can apply to all objects. If so, it is probably an attribute of the domain object (like the recording technique of a song or whether a security’s income is taxable).
- If your new category is Boolean (usually yes or no), it may or may not be a good choice for a classification category. For example, in financial services, IS_TAXABLE is not a good category since it is really an attribute (every domain object has this attribute). However, the Boolean category IS_CONVERTIBLE is acceptable because it only applies to a subset of categories (e.g., fixed income).
- Each category should not be defined too narrowly (very few domain objects fall into this category); if so, it may rarely (if ever) get assigned to the domain objects and even more importantly, used rarely in reporting.
- Similar to the previous point, ensure some dispersion. If the expected time frame before a classification is first assigned is relatively long (for instance more than one year) it may be better to keep “unused/future” categories on the drawing board. With several categories having very or no few domain objects, you can expect business users to complain that a search using the category is not working. Seeing the category implies there are members.
- Stay brutally focused on the domain. Consider another example from financial services: Many firms tend to consider “municipals” as a type of security, putting it alongside other fixed income categories like convertibles, mortgage backed, floating rate notes, etc. But the term “municipal” actually describes the issuer, not the security. When you test the classification (see below), you’ll see it fails the test. However, “municipal” is a very reasonable issuer classification
- Socialize it, because feedback will foster consensus. Encourage multiple, frequent reviews to achieve incremental agreement and have several sample domain objects for each category in the classification. This aids understanding by the users of the classification.
- As a rule of thumb, define no more than 10 to 12 categories under the level above. More than that suggests a higher-level grouping or level in between may be beneficial. For example, you may introduce a level in between, with three categories, where each spans out to four (which equals the original 12). Consider how often manual assignment of the classification will be made.
- Try to select one or two words to name a category if possible, because they will likely be displayed in an application GUI tree or displayed on a report. If you make the category names too long, they’ll be harder to refer to, and users will abbreviate them with acronyms.
- In a multilevel classification, create as few “Other” categories as possible and place them at the leaf (lowest) level.
- In a multilevel classification, every category should be a formal subset of its parent category. If the new lower category can capture domain objects in any other level above, it is likely not a good candidate (or it may be an attribute of the domain object). An example of this is:
- Level 1: Debt
- Level 2: Issuer Backed
- Level 2: Money Market
- Level 3: Commercial Paper
Here is ambiguity because Commercial Paper could arguably fit under both Level 2 categories. One solution is to rename the category from Issuer Backed to something else (like Bonds) which does not overlap with Money Market (or remove it altogether).
A balanced classification means each branch from the root has an equal number of levels (but not necessarily categories). An example of a balanced classification is illustrated in Figure 1.
A non-balanced classification scheme might look like Figure 2.
There is no inherent problem in having a non-balanced classification, as long as your GUI/application can adapt and support variable levels. If, for some reason, you’re constrained by the GUI or application, just clone the levels downward until you have a balanced classification. To balance the unbalanced classification in Figure 2, add the following:
- Under Category 1.1.1, add Category 184.108.40.206
- Under Category 1.1.2, add Category 220.127.116.11
- Under Category 1.1.3, add Category 18.104.22.168
The category “Other” is always problematic. When someone recommends including an “Other,” it may indicate more analysis is required to complete the classification. “Other” can be a crutch. It is too easy to assign “Other” rather than conduct the research to assign the correct category. Reclassifying later (from “Other” to the proper category) may affect reports that were previously run – business users tend to distrust a system or its data when they get differing results in repeated runs.
What if you absolutely need “Other”? Put procedures in place to flag when “Other” is assigned to any domain object so you can reassign it as soon as possible. Often, business users will not include the “Other” category in analysis/reporting – but they may in fact be missing domain objects that they should include. Depending where “Other” is in the classification, it may contain unclassified domain objects from anywhere in the classification.
To model a multilevel classification it the database, consider creating X tables where X equals number of levels + 1. For example, Figure 3 represents a how the four level classification in the previous section might be modeled.
Table “Class_A_Categories” contains all the category definitions for Classification A, on all levels.
Table “Class A” contains the valid combinations of categories on each level.
Table “Domain_Object” are the objects which take a single classification from table “Class_A.”
A GUI/application can present the categories for selection by starting with ID_Level_1 (and showing the associated Class_A_categories.Desc) then when the user has selected a category, present the associated ID_Level_2 and so on until the leaf category has been selected.
Testing the Classification
Before distributing the classification for comment, find preferably three domain objects that fit into each category and display them in a table. This will give reviewers a concrete understanding of what you mean by each category. It will also give you a chance to verify your category selections.
Seek out edge cases and classify them. Can the domain object arguably fit into two categories? Explain in a short write-up. Fabricate domain objects and step through the decisions at each classification level to arrive at the correct category. This will provide the reviewer with confidence that business rules can be written to ensure accurate and consistent assignment.
Test-drive the category names with the business users. From their point of view, are the category names clear, meaningful, understandable and unambiguous?
Now that you’ve successfully established and published your classification you will need rules on how to manage it. Basically, this falls under the topic of data governance.
- If you have a data steward (but no data governance council), this is the person that should own, or at least sign off on any classification changes.
- If you have a data governance council (DGC), the council will own the classification decisions.
- Whether the classification assignment is automated or manual (via GUI), ensure all change requests are documented, including justification and examples. Often a little research will show the change request can be accommodated with the existing classification.
- If your firm is large enough, record the minutes of the DGC meetings/decisions and post them on a website (like ISO does). Keep metrics about when and how many categories were adjusted. This will be informative for new users of the classification and will reveal trends as to how well the classification is serving its purpose.
- Publish upcoming changes to your classification. Future date them so consumers can make appropriate changes to their systems and reports (if required).
- Establish guidelines around timing, and include consideration of requests and decisions. If timelines are too long, business users are going to look for other solutions.
- One of the best ways of implementing a classification change is through software. Not only will you be changing the assignment logic going forward, you may also have to reassign existing domain objects.
- If you have to add categories, perform the same analysis effort and follow the same rules as when you constructed the classification. A sign of a good classification is when several requests for expansion can in fact be accommodated with the existing classification. But before adding, prove to yourself that the domain objects in question cannot be accommodated by existing categories.
- Periodically (once a year or quarter, as appropriate for the rate of change in your industry), test the classification with a few of the newer or newly created domain objects to see if they were correctly classified.
For an example of data classification in the financial securities industry from the author, click here.
One rarely builds or encounters a perfect classification. Over time, classifications need to be modernized and adjusted to stay relevant. Governing how the classification is updated must be established, published and consistent. The classification must be strictly focused on its target domain. Most of all, communicate often during the classification building process, seek feedback and stay flexible to changing conditions which require adjustment or expansion of the classification.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access