Hub-and-Spoke Redux
Data Integration Advisor
Information Management Magazine, May 2005
Religious wars. We have them in IT all the time. Everyone debates architecture. The "high priests" of our trade guard it. The approaches espoused by Ralph Kimball and Bill Inmon are at the center of one of the biggest data management religious battles.
Kimball's approach defines the data warehouse (DW) as "The queryable source of data in the enterprise. The DW is nothing more than the union of all the constituent data marts."1 Star schemas, conformed data marts and no need for those nasty data warehouses!
Inmon states that "The data warehouse is at the heart of information and analytical processing for the corporation..." but also that it "...is not the answer to all problems..."2 Inmon recommends building data marts fed solely from the DW: "There is only one legitimate feed into the data mart; that feed comes from the data warehouse." Inmon disciples use normalized data, data warehouses, dependent data marts and that hub-and-spoke architecture. Who is right and who is wrong?
Advertisement
Only 16 percent of companies have implemented the Kimball approach with conformed data marts, versus 43 percent that use a DW with dependent data marts or a hub-and-spoke.3 If you add in the 18 percent response for central data warehouse-only implementations, the total following Inmon's philosophy is more than 60 percent. This is also consistent with the Inmon architecture, especially if you dig deeper into its makeup.
However, merely looking at the physical characteristics, such as whether the DW or data marts are located on separate severs or databases, does not take into account its data architecture principles. In many ways, the Inmon and Kimball camps agree on quite a bit. They have similar approaches despite these two glaring differences:
- Inmon's persistent data warehouse versus Kimball's transient data staging area.
- Where detailed and historical data should be stored.
Kimball and Inmon Similarities
Let's look at the similarities. Both camps value data and feel it should be managed. Both camps believe the star schema is the best modeling technique for business intelligence (BI) enabled by data marts. The debate over normalized versus star schema is not about whether it should be used in data marts, but whether it should be used in the data warehouse - which Kimball says does not need to be built anyway.
Very importantly, both camps believe that the key to consistent data, regardless of how it is physically instantiated, is conforming dimensions. Your corporate numbers can only be consistent if your dimensions, such as products, customers, suppliers and employees, are conformed.
The problem of inconsistent dimensions has also appeared with ERP system offerings. Companies that have implemented enterprise resource planning (ERP) systems are faced with inconsistently defined dimensions, such as product and customers, across ERP application modules or instances. The ERP vendors, who are also offering their own data warehousing packages, are addressing the inconsistency problems by conforming the dimensions within their systems to enable consistent operational reporting. Conforming dimensions is being referred to in the ERP systems as master reference data management. However, regardless of its name, conforming dimensions is a best practice for both business intelligence and operational reporting.
A more subtle, but maybe surprising similarity is that hub-and-spoke is the best form of data integration. Don't let the difference between a persistent DW versus a transient data staging area distract you from the fact that both camps recommend bringing the data to one construction site (DW versus data staging), which is a hub, and then distributing the data to data marts that are the spokes. Both sides recommend common ETL (extract, transform and load) processes to enable this hub-and-spoke architecture.
Why hub-and-spoke? When you examine ETL processing, that approach creates one-to-many (DW to data marts) versus many-to-many interfaces if you had a point-to-point approach from source systems to data marts. The one-to-many is initially simpler to implement (except that you have to design and plan up front) and generally more cost-effective in the long run. And, it's more likely to ensure consistent (and conformed) dimensions. It is not surprising that companies using an ETL tool rather than hand coding overwhelmingly prefer the one-to-many architecture because it is more cost-effective. Hand coding tends to create pockets of hidden interface code. Because they encourage sharing and increase visibility, ETL tools reduce this problem.
To Build or Not to Build
The big disagreement is whether you need to build a data warehouse at all. There are horror stories of data warehouses costing too much, taking too much time to complete, having too few users, having a lot of data but little information, becoming monolith entities, and the IT group supporting them being unresponsive to business needs because they are so busy with data warehouse care and feeding. Data warehouses are sometimes called "data jails" - or worse.
However, is the problem the data warehouse or how it was built? Too often DW projects become big-bang projects that overwhelm overworked, under-funded IT departments. Not building a data warehouse may be appealing, but Kimball's alternative isn't necessarily a piece of cake. You'd still have to implement a hub-and-spoke, conforming the dimensions and feeding the dependent data marts via a data staging area. That's the hard work. Storing the data in a persistent data warehouse is not that tough after that work is done.
From a pragmatic perspective, why not store the data in a data warehouse? Without a data warehouse, the data marts have to store all the detailed data they will ever need or it gets lost. Why not have dimensions and facts in one place? Why not have historical data available as a corporate asset? If the data staging area is not persistent, then how do you handle changes in organization and product structures (dimensions) that happen often enough in corporations today?
Regardless of what your philosophy is regarding data architecture, the reality is that most large corporations have many data warehouses or independent data marts. According to a survey by TDWI in Winter 2004, the average organization has two data warehouses, six independent data marts, 4.5 operational data stores (ODSs) and 28.5 spreadmarts (data shadow systems).
Page 1 of 2.







