In last month's column, we looked at a problem associated with the standardized use of reference data for information exchange. In that column, we explored how a business application's use of what was expected to be standardized data varied slightly from its original meaning, leading to a lack of synchronization across the exchange enterprise. In this column, we explore a different aspect of the same problem - the definition and use of data values grouped within semantic hierarchies known as taxonomies.
A natural application of a business intelligence (BI) application is to attempt to understand individual behavior based on an enveloping classification scheme. A classic example related to the media industry is characterizing radio or television ratings by sex and age groups, such as "most popular with males aged 18 to 35." Here, there is a multidimensional classification scheme - first the population is divided by gender, then each gender group is broken into age ranges. In turn, performance is assessed based on the ratings for specific "products" within what is considered to be a simple taxonomy.
A taxonomy is a hierarchical means for classification organized according to a predefined system. The system should provide a natural dissection of the data elements into hierarchical groupings, with each set of subgroups unambiguously distinct from the others, yet all subgroups covering all the possibilities. The implication is that a reasonable taxonomy will provide clarity when slicing and dicing, but only when care is taken to abide by the basic rules, which we will discuss.
Much of what people encounter in everyday life is related to a hierarchical taxonomy, and many BI applications rely on this for analysis. There are different kinds of taxonomies used in different kinds of analyses such as: geographical, order-based and product-based.
A geographical taxonomy provides a hierarchy defined by encompassing location. For example, a street address is located on a street, which is in a neighborhood, which is in a town, which is in a county, which is in a state, which is in a country. Individual behavior is categorized based on location, and aggregation is performed along the encompassing boundaries. An example of an order-based taxonomy may be the internal structure of a company - lines of business incorporate divisions, which contain groups made up of individuals. Aggregation may take the form of productivity and performance metrics, grouped within lines of business or measure an individual's productivity. Product-based hierarchies might be aligned by product class (e.g., automotive supplies), then product category (e.g., air fresheners), then product name ("Pine-Fresh"). A BI application would measure revenues, margins and profitability across the product hierarchy.
Taxonomies are great for defining categorization, especially in OLAP-style analysis. In retrospect, taxonomies themselves represent business knowledge as reference data, and all data that is related to that reference data is affected by the quality criteria assigned to the enumeration of codes, the mapped values, the number of levels within the hierarchy and the methods for insertion into the hierarchy.
Despite the apparent simplicity of a value hierarchy, problems can emerge without proper attention to these two basic concepts:
1. At each level in the hierarchy, there should be an unambiguous distinction between the values. This means that there should not be any overlaps in definition (or in the values collected at lower levels of the taxonomy), nor should there be any gaps (i.e., missing values within the level).
2. There must be a coordinated approach to modifying the taxonomy. In other words, when it is clear that there may be gaps in the value set or that there are values that imply the introduction of new levels in the hierarchy, there must be a "political" framework in which the new elements are proposed, debated, modified and approved to maintain synchrony.
Not abiding by these rules will have some obvious negative consequences, mostly in allowing the de-evolution of the value of the hierarchy. The existence of gaps or overlaps in the value sets leads to difficulty in presentation of results. For example, in a pivot table, how are elements aggregated when the same values appear under multiple subgroupings? Similarly, a lack of coordination regarding the injection of new information into the hierarchy allows for semantic dissonance, as individual participants begin to overload values with meanings that are not agreed to by the rest of the constituency.
At first glance, a simple coded mapping of values within a two- or three-level hierarchy appears to be relatively simple, but I am sure that every reader has a story that describes the complexity of taxonomy management. If you'd like to share your story, e-mail me (firstname.lastname@example.org), and I will relate your experiences in future columns!
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access