William wishes to thank Mike Cross (firstname.lastname@example.org), data warehouse director at Rent-A-Center, for his contribution to this month's column.
This is the fourth in a series of articles on data warehousing concepts and best practices. In this article, we will address a new area that is fundamental to good data warehouse design - hierarchical relationships.
Parent-child or hierarchical relationships are a quintessential element of every organization and every data warehouse. They define the relationship between two entities (whether it be a subpart to a part, a worker to her manager or a store to a market) and underlie almost all BI reporting. For example, Rent-A-Center has more than 3,500 stores that report through markets and regions to 12 divisions. For home office use, our BI group aggregates and reports the data at the division level. Because data warehouses usually store data at a much lower aggregation level than how it is reported, designing, representing and traversing hierarchies is fundamental to the success of the warehouse.
There are multiple ways to architect hierarchical relationships within data warehouses and data marts, all of which have advantages and disadvantages. Before looking at the modeling techniques, there are some basic assumptions about the data and hierarchical relationships. We will use the term "parent and child" to imply a hierarchical relationship, but realize that most entities will be both children and parents, depending upon the level you are at within the hierarchy. The first and foremost assumption is that at any given point in time, a child may have only one parent; second, with the exception of the top parent, a.k.a. head, all children have a parent; and third, data is maintained at the leaf level, that is, a child that is not a parent.
There are two basic modeling techniques: a flattened (horizontal) hierarchy and a relative (vertical) hierarchy. The first technique simply employs a table with one column for each potential level within an organization and one record for each leaf entity. The table is then populated horizontally either top down (head first) or bottom up (leaf first). It is often the case where some entities will not all have the same number of hierarchical levels as other entities (such as an employee table) and the horizontal approach will create a "ragged" alignment.
Vertical hierarchies, because of their abstract nature, are more powerful and can replicate any hierarchy that meets the just-mentioned assumptions. They are, however, much more difficult to maintain, query and use for reporting. A vertical hierarchy simply defines a parent-child relationship between two entities. To determine the full ancestry of a given entity, you must recursively find the parent of a parent until there is no parent. Conversely, to find all descendants of an entity, you must recursively find all children of all children until there are no children. For database management systems that have recursive capabilities, this is relatively easy; for those that do not, it is not so easy.
A more robust form of the vertical hierarchy goes beyond parent-child and links all ancestors while providing a relative distance between the two. For example, in a simple parent-child structure where Mike's parent is John, Chad's parent is John, John's parent is Robert and Robert has no parent, four records would be created. In a robust vertical hierarchy that fully links all ancestors, two additional records would be added showing a relationship between Mike and Robert, and Chad and Robert both with a relative distance of two. This expanded technique, while unnecessary, greatly eases reporting and eliminates the need to recursively traverse the hierarchy.
An additional consideration for all hierarchical relationships is a time factor. You want to be able to show how a relationship looks not only now, but also in the past and possibly in the future. To accomplish this, simply add a start time and an end time to each record. The start time becomes part of the key, but should not be included as part of the reference or foreign key for parents in vertical hierarchies. End times are only populated when relationships cease to exist or are replaced by new relationships. If you want to do an end-of-year report at the division level that includes all that were open any time during the year, you simply need to find the alignment where the end date is between January 1 and December 31 or the start date is before December 31 and the end date is blank.
Both hierarchical techniques have their uses within data warehouses and data marts. For relationships that have well-defined levels that all organizational entities fall within, the horizontal hierarchy is by far the easiest to implement and understand, but may create significant challenges if your assumptions change. For hierarchies that are variable in depth, a vertical approach should be taken. At Rent-A-Center, we maintain five different pure parent-child hierarchies within the data warehouse. This technique was chosen because of its capability to adapt to almost any operational change despite promises that "this will never change" (it has), it is the easiest for us to maintain and because it ultimately consumes the least amount of storage. Within the data marts, these hierarchies are transformed into horizontal and robust vertical hierarchies, depending upon the needs of the reporting tool and user community.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access