Continue in 2 seconds

We met with one of the BI vendors to discuss query and reporting and multidimensional reporting.

By
  • Joe Oates, Chuck Kelley, Michael Jennings, Clay Rehm, Les Barbusinski, Scott Howard
Published
  • April 02 2003, 1:00am EST

Q:

We met with one of the BI vendors to discuss query and reporting and multidimensional reporting. As the discussion evolved, we talked about discarding the data marts (star schema) and just building cubes instead. We have a relational enterprise data warehouse. We planned on creating data marts based on the KPIs identified by the customer. A query and reporting tool would create reports and user could query the data directly. Can cubes replace marts? When do you need both?

A:

Les Barbusinski’s Answer: First off, I’m not a big fan of cubes. They are inflexible (in most cases, a new report means a new cube), problematic to maintain (i.e., expect your pager to go off almost every night), difficult to tune (i.e., cubes are proprietary black boxes with few tunable parameters) and offer limited scalability (i.e., most cubes hit a performance "wall" above 2-to-5 gigabytes in size). Indeed, most of the shops I’ve seen implement a cube-based (a.k.a. MOLAP) business intelligence tool have subsequently purchased a star-schema based (a.k.a. ROLAP) BI tool.

At any rate, regardless of which BI tool you ultimately choose, you should go ahead and build your data marts. There are several benefits that star schema data structures offer besides facilitating OLAP functionality. First off, dimensional data structures more closely mimic the way a business works and, as such, are more easily understood and navigated than other structures. Second, they offer significant performance advantages over the more normalized data structures found in the data warehouse. Why are these benefits important?

At some point in time, as your data warehouse environment matures, it will reach a critical mass – a point where the quality and quantity of cleansed, synchronized and integrated information in the data warehouse makes it an vital resource to the entire enterprise. When this happens, the data warehouse will be expected to serve up large quantities of information on demand in a variety of mediums (e.g., Web-based reports, extract files, e- mails, wireless alerts, etc.) using a variety of tools (e.g., flat reporting tools, ad hoc query tools, BI tools, data mining tools, statistical analysis tools, Web services, messaging systems, etc.). This is the classic data delivery module mentioned in Claudia Imhoff’s articles on the Corporate Information Factory. If you have a substantial amount of your data and pre- calculated metrics embedded in proprietary cubes, it will be difficult to access and deliver in the desired medium. Having the data in relational structures, on the other hand, makes the data visible, navigable and accessible. Furthermore, queries can be tuned, and a variety of tools can be brought to bear. Flexibility is a wonderful thing.

Scott Howard’s Answer: Who mandated that data marts must be relational? I often wonder – because many just like you feel that data marts are the exclusive realm of the star schema. My definition of data marts is that they are specific data warehouses customized and limited to the requirements of an individual business entity and tuned to their unique query tool usage. Cubes are data marts based on this widely accepted definition; therefore, cubes can not replace marts because they already are marts. Your enterprise data warehouse should source all of your data marts be they relational, multidimensional cubes or mining bases. Yes, considering our original definition, mining bases are data marts too.

I think that the next logical question should be, when should I implement cubes and when should I implement relational star or snowflake schemas? I’ll answer my proposed question with generalizations, so realize that some of your specific business needs may lead to exceptions to my recommendations. Most popular multidimensional analysis products and tools can access both cubes and relational schemas with equal ease and function so the tool of choice is seldom a mitigating factor. It’s the dynamics of the source data combined with the characteristics of the cube maintenance process that will help you decide if a cube or relational schema is best for a give mart.

Cube a generally expensive to maintain and refresh. Volumes of data need to be loaded followed by extensive aggregate calculation and population. Data additions may mean that most aggregates need to be recalculated and repopulated. Restating the base measures mean that many or all aggregates need to be redone. This is expensive and time-consuming and a factor that dissuades one from implementing cubes. Now many vendors, including the industry-leader Hyperion, has features like partitioning to help ease this pain, but that too requires careful planning and professional support. Cubes are also autonomous and can not be easily combined or dynamically joined to other cubes at run time. To net it out, if your underlying data is dynamic or you need to support a dynamic ad hoc query set, cubes may not be your best option.

Now relational schemas also need to be scrutinized. They are more appropriate for the dynamics of measure restatements or unpredictable query demands, but they will not provide the speed-of-thought runtime performance demanded by many users. That reason alone may drive many to tolerate expensive cube maintenance. Relational vendors are adding multidimensional features that are starting to reduce the performance gap between multidimensional OLAP – the cube and relational OLAP – the star or snowflake. The obvious features that help are more advanced built-in statistical function. Recent advancements by all the major relational vendors include materialized query tables or views where the most frequently accessed aggregates are calculated and stored in persistent database objects and automatically used when any query requests those aggregates. This can lead to very significant performance boasts. Another new technology introduced by IBM in their DB2 UDB product is Multidimensional Cluster (MDC) indexes. They are similar to the block indexes provided by cube vendors in support of their sparse cubes but have been adapted on the pure relational model. These indices require minimum maintenance and provide superior and very significant performance improvements for most relational OLAP queries. Combine these MDCs with materialized objects and you can have a formidable alternative to cubes. Now I’m not suggesting they replace cubes as the performance advantages of cubes are still substantial, it’s just that those advantages are not as overwhelming as cubes once enjoyed.

Mike Jennings’ Answer: From my implementation experiences, there are several issues to consider for deploying OLAP cubes to a user population. What is the volume of data that needs to be available in the cube? Transformation processing time to create and update the cube starts to increase substantially once the size of the cube goes above five gigabytes. You need to examine if the data refresh frequency (daily, weekly, monthly, etc.) and data transformation processing window size is sufficient to meet business needs. Depending on your infrastructure, the cube may be build on one server and have to be distributed to multiple servers for user access. Next, what type of security needs to be applied to the cube? Role level security requires very little overhead in order to be applied to the cube. Incorporation of row level security for each user accessing the cube can increase transformation processing time by 200 to 500 percent making it even harder to meet processing window schedules. Next, what is the size and composition of the user population receiving access to the cube? The skills required to successfully navigate, analyze and drill into a cube typically requires additional training and support for end users because of the new paradigm. Some of this complexity can be mitigated through use of view access to the cube, which provides a method similar to the query reporting products.

Chuck Kelley’s Answer: It all depends on your definition of data mart. I think data marts are subsets of data populated from the data warehouse. They can be cubes or star schemas. Cubes are star schemas are just the way the data is stored. Which one you choose will depend on the tool that your user community uses.

Joe Oates’ Answer: In my opinion, having traditional data marts based on a relational database is the best way to go. This approach will provide the most flexibility in meeting future changes in user requirements. Discarding the data marts will lock you in to a particular vendor and the limitations of that vendor’s products. Additionally, it is my experience that changing and adding functionality to cubes is generally much more difficult than doing the same things to a relational database.

I suggest that you create a matrix of critical issues and rate the cubes and data mart approaches. After you do this, list the risks associated for each approach.

Clay Rehm’s Answer: The cubes you develop using your BI tool will be used by that tool. What about other tools? If you have other tools in your organization, the cubes most likely will not be accessible to them. You need to have flexibility in your design so any tool can access your database. I encourage you to build your data mart using a relational database, and then build cubes from that data mart using the tools that will access the cube.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access