Q: I am modeling a new data warehouse. I will be pulling some hierarchy data from a source system that stores multiple hierarchies in one table. I know that I need to break the data out into separate structures for each hierarchy (product line, organization, etc.) The trees are very uneven. One branch might have five levels, but another might only have two. Do I put each hierarchy into one table or do I break each hierarchy into separate tables (one for each level). If I do the later, then how do I handle the shorter branches?
Clay Rehm's Answer:
I have worked with clients that strove to get the hierarchies to line up at the same levels. This required input from the business community and sometimes resulted in values that were really placeholders than actual values. This way, the users could run queries and reports and the hierarchies all came out to the same number of levels (even if some of the values were somewhat substitute values).
Tom Haughey's Answer:
First and foremost, break the data into separate hierarchies, one for each type of data, such as a hierarchy for product, for organization, etc. It would be needlessly complex to store them all in one table. The queries to one composite hierarchy table and joins using it would be complex and probably have some performance issues. In addition, a composite hierarchy table would be more complex than any one hierarchy table because it has to account for a way of differentiating the individual hierarchies. For an analogy, consider the issue with a single code table encompassing all code data. There are advantages but there are disadvantages.
Second, this sounds like a classic ragged hierarchy. A ragged hierarchy is a hierarchy in which children do not have parents at the same next level. In other words, not all leaf level (or parent level) entries have the same depth. In a logical model, the major way to represent this is via a recursive structure, an example of which I have included below. In a physical model, there are a number of ways to implement a ragged hierarchy. To decide which is the best, it is important to gather some design factors, namely, the maximum and average number of levels in the hierarchy, the number of attributes at any level, the required response times, the number of concurrent users, and the technology (including hardware, DBMS and SQL). I refer you to my article on Hierarchies on www.tdan.com, a few months ago.
The maximum depth of the hierarchy is important. A hierarchy that needs to support (say) 30 levels will need to be much more flexible than a hierarchy that will only ever to 5 levels.
Here is a summary of some alternatives.
Recursion. If your hardware and DBMS are robust (and your SQL supports recursion), then can use pure recursion. It is easily the most flexible solution and can implicitly handle a ragged hierarchy. It is designed to handle uneven depths. That is the meaning of a ragged hierarchy. A recursion has not problem with the raggedness of the hierarchy. A typical bill of materials is very ragged and recursion is designed to support this. Yet, understand the limitations of any SQL extensions to support this, especially if you are using Oracle. In Oracle the result produced by CONNECT BY cannot be joined to other data. The usual objection to using recursion is two-fold: performance and ease of programming. In tests that I ran, this was the worst performing of the alternative presented here, but (importantly) it was not a disaster as it was in the old days. Ease of programming has been resolved with the new SQL extensions.
Figure 1 is a generic example of pure recursion. It assumes there are different hierarchy views (Hierarchy) and that the levels are named (Hierarchy Level):
Flattened dimension. If the number of levels and attributes is few, then consider flattening the hierarchies into a denormalized dimension. This is what is done in a pure star. In this all the levels are in one row. The denormalized dimension must have all the attributes sufficient to handle the maximum number of levels. Reporting hierarchies do not usually have the depth of a manufacturing bill of materials so it is usually not a problem to flatten them this way in terms of the width of the dimension. The issue is what to do about missing levels. If a level is missing, there are two approaches: 1) use a skip flag to tell the query to skip this level; SQL Server does it this way. 2) store the next higher level in the missing level.
Snowflake. If the number of levels is few and fairly predictable, and the number of attributes is many, then consider creating fixed but separate tables for each level of the dimension. In essence, I'm saying snowflake the dimension. This works well for regular or symmetric hierarchies, but will be complex for ragged hierarchies. Clive Finklestein in his first book, Information Engineering, shows a manufacturing example of this but is absurdly complex to do this. So this is not a good solution if the hierarchy is really ragged.
Nested Sets. This solution involves forward and backward pointers for each entry in the hierarchy. Joe Celko advocates this. See my hierarchy article referred to above for his and other references. Regardless of the complexity, if the hierarchies are not populous (that is, have a few number of instances) and the change to the hierarchies is limited, consider nested sets, as described by Joe. If the hierarchies have a large amount of data and change is frequent, the daily maintenance of nested sets can be prohibitive. This is so because the entire hierarchy will have to be re-built each time any change occurs. In some firms, organization and product hierarchies can change daily. I'm not saying the whole hierarchy will change, but any change, regardless of how small, will require the whole hierarchy to be regenerated. This is not true in a recursion; you only have to insert the new association between parent and child and change the next immediate members.
Descendent tables. If the number of instances and levels are comparatively few, consider descendent tables, which are sometimes (inappropriately) called speed tables. Speed tables is too generic a term for these. In a typical recursion, there are rows for a parent and its immediate children. You get all the descendents by navigating through the successive parent-child relationships. In descendent tables, there are rows for each parent and all of its children - immediate, intermediate and ultimate. There are two issues here. 1) If the number of items in the recursion is many, the resulting table can be huge. Because of this, descendent tables would be prohibitive in a large manufacturing bill of materials. For example, an F-15 plane consists of 75,000 parts. To store all descendents for each part would result in a huge table. 2) If there is a lot of maintenance to the relationships in the hierarchy. If the parts were added and updated regularly, the amount of maintenance would also be enormous.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access