Today's competitive business environment, combined with unprecedented levels of affordable computing power, has had a significant impact on business systems. This competitive business environment creates a need for complex analyses of increasing volumes of business data. Today's sophisticated business systems provide large amounts of raw data to support such analyses. One of the technologies resulting from this need to turn vast volumes of business data into meaningful management information is data warehousing. Another is on-line analytical processing (OLAP).
The concept of storing data that is optimized for decision support and keeping it separate from the operational data from which it is derived offers many advantages. Data can be managed to support fast, multidimensional queries. Derived metrics can be effectively computed, wherever possible, and the integrity of the data can be assured as part of the cleansing process that it goes through while being loaded into the warehouse.
OLAP can be thought of as a technology for knowledge workers. The term OLAP was first coined by Dr. E.F. Codd, the inventor of the relational model, to describe a genre of software that analyzes business data in a top-down hierarchical fashion. It is significant to note that Dr. Codd cautioned that "relational databases were never intended to provide the very powerful functions for data synthesis, analysis and consolidation that are being defined as multidimensional analysis."1 OLAP tools present data in a multidimensional fashion that non-technical users find intuitive. Users are able to navigate through their business data in a simple way, gaining insights that they simply have not been able to gain using previous approaches, ranging from printed reports to report writers to query tools.
In order to understand the significance of OLAP, it is critical to understand the multidimensional nature of today's business data. Consider the example of the vice president of sales of a retail company. He wants to analyze how products are selling across his retail outlets. He also needs to be able to spot sales trends, identify sales strategies that are working (and those that aren't), learn all he can about how products are selling and what he could do to improve sales and profitability.
The vice president of sales' key concern is sales by product. But for each product, he needs to know unit sales, dollar sales, discounts allowed and, perhaps, certain other key statistics. He also wants to know this information on a daily basis--by region, by salesperson and by channel. What we have identified is a six-dimensional model. The dimensions are accounts (often also called variables or measures), products, time, channel, region and salesperson.
One of the key features of OLAP technology is that the user is able to navigate through the data in any way that makes sense, without knowing in advance what the navigation route might be. Consider again the case of our hypothetical vice president of sales. He first looks at the total sales by region, comparing this month with the same month last year. He notices that month on month sales are flat, despite the fact that he has opened three new stores in the last three months. The first thing that he does is to drill down to look at sales by store, still comparing this month with the same month last year. It quickly becomes apparent that most of the stores have gained a healthy seven percent over last year. He is able to determine this by asking his OLAP tool to show the derived metric of this year's gain as a percentage of last year and ranking the stores based on this new derived metric. But the bottom of the ranking shows that certain stores have actually declined. By drilling down by region, he is able to ascertain that sales from the new stores appear to be cannibalizing sales from existing stores. Equally, his inquiries may have shown that the decline in sales was caused by certain product lines that were selling poorly. In this case, he would have navigated the data differently to reach his conclusion.
One of the design objectives of the multidimensional server is to provide fast, linear access to data regardless of the way the data is being requested. The simplest request is a two-dimensional slice of data from an n-dimensional hypercube. The objective is to retrieve the data equally fast, regardless of the requested dimensions. In practice, such simple slices are rare; typically the requested data is a compound slice in which two or more dimensions are nested as rows or columns. Put another way, the goal is to provide linear response time regardless of where the data is being retrieved from in the hypercube.
The second role of the server is to provide calculated results. By far the most common calculation is aggregation; but more complex calculations, such as ratios and allocations, are also required. In fact, the design goal should be to offer a complete algebraic ability where any cell in the hypercube can be derived from any of the others, using all standard business and statistical functions, including conditional logic.
Most OLAP servers achieve fast response to computed results by computing them in advance. This technique can be very effective but will not work well when the size of the fully calculated model is thousands of times greater than the volume of the input data. This may sound unlikely but, in fact, this can easily happen, particularly where the number of dimensions is large and the hierarchies in each dimension are deep.
This precalculation technique is not effective when the input data is being updated in real time, such as in interactive budgeting or financial reporting applications. This is because analysts want to see the effect of changes immediately. Forcing a batch-mode recalculation of some or all of the hypercube is very disruptive to the process.
Balancing the tradeoffs of supporting complex calculations, providing fast access to data, minimizing data explosion and supporting real-time updates has led to three major architectural approaches to OLAP: ROLAP, MOLAP and RTAP.
Multidimensional Views of Relational Data
Some vendors take the view that all data should be stored in relational databases. They provide a multidimensional view of data that is stored in relational tables. In order to provide this multidimensional view, all of the relational OLAP vendors store the data in a star or snowflake schema. The most common form of this schema stores the data values in a denormalized table known as the fact table. One dimension is selected as the fact dimension, and this dimension forms the columns of the fact table. The other dimensions are stored in additional tables with the hierarchy defined by child/parent columns. The dimension tables are then relationally joined with the fact table to allow multidimensional queries.
The data is retrieved from the relational database into the client tool by SQL queries. Because SQL was designed as an access language to relational databases, it is not necessarily optimal for multidimensional queries. For example, SQL can perform more complex calculations across rows than across columns. This is not a deficiency of SQL as much as a reflection of the fact that the relational model was invented to overcome a number of problems associated with database management. One of the primary problems was maintaining database integrity and ensuring consistent data updates. By storing the data in relational tables, a single piece of data is stored in only one place. This ensures that the database is consistently maintained and that transaction updates can be performed in a fast and efficient manner.
Although fact data is indeed stored in a relational table and can be accessed using the RDBMS, to do so may present some surprises. In order to provide the multidimensional views of the data required by vendors that use a relational database, the data must be organized in the star or snowflake schema described earlier. This means that, in practice, the data must duplicated. There are many other good reasons why the data should be duplicated anyway (such as performance, summarization and organization of data into distinct time periods). Thus, the implication that using an RDBMS eliminates the need to duplicate the data is not valid. The vast majority of ROLAP applications are for simple analysis of large volumes of information. Retail sales analysis is the most common one. The complexity of setup and maintenance has resulted in relatively few applications of ROLAP to financial data warehousing applications, such as financial reporting or budgeting.
Physical Multidimensional Databases
The next two major OLAP architectures, MOLAP and RTAP, provide their own physical multidimensional databases. These architectures assume that multidimensional models--because of their unique characteristics of sparsity and the need for potentially complex derived results--need an architecture all their own. Some of these architectures actually predate the relational model.
Although all software vendors are proud of their architectures and believe they are uniquely efficient (several vendors have actually patented their algorithms), in practice there are some common techniques that many vendors use. These techniques include:
- Mapping out zero values by compression,
- Using indexes of pointers to compressed arrays of values, and
- Sophisticated caching algorithms.
Before discussing these two architectures, it is important to understand how sparsity causes database "explosion" and its implications.
It is not immediately obvious that a fully calculated hypercube is usually dozens of times--and, in some cases, many thousands of times--larger than the raw input data. Some would argue that this is not a problem because disk space is relatively cheap. Despite this, consider what happens when a 200MB source file explodes to 10GB. The database no longer fits on any laptop for mobile computing. Also, consider what happens when a 1GB source file explodes to 50GB. The database no longer fits comfortably on typical desktop servers. In both cases, the time to precalculate the model for every incremental data change will likely be many hours. So, even though disk space is relatively cheap, the full cost of precalculation can be unexpectedly large.
The MOLAP approach is to provide a database specifically designed for multidimensional data and then precalculate all derived values. The theory is that precalculating all derived values will result in very fast retrieval times and that data explosion does not matter since disk space is relatively cheap. Retrieval times are typically one second or less, which is one of the reasons for the widespread acceptance of MOLAP.
Because derived values are precalculated and stored, when incremental changes to the data are made, a user must recalculate the database. This means the users must wait, typically many hours, until the calculation is finished. Some MOLAP engines provide the capability to recalculate only those derived values that were affected by the change. This cuts the calculation time on the order of 50 percent but still does not lend itself to applications that must frequently change the data.
Real-Time Analytical Processing (RTAP)
RTAP takes the approach that derived values should be calculated on demand, not precalculated. This avoids both the long calculation time and the data explosion that occur with the precalculation approach used by most OLAP vendors. In order to calculate on demand quickly enough to provide fast response, data must be stored in memory. This greatly speeds calculation and results in very fast response to the vast majority of requests.
Another refinement would be to calculate numbers when they are requested but to retain the calculations (as long as they are still valid) so as to support future requests. This has two compelling advantages. First, only those aggregations that are needed are ever performed. In a database with a growth factor of 1,000 or more, many of the possible aggregations may never be requested. Second, in a dynamic, interactive update environment (budgeting, for example), calculations are always up to date. There is no waiting for a required precalculation after each incremental data change.
One may question whether a multidimensional application of any size can fit into memory. The answer is yes, for several reasons. First, all multidimensional databases store each number very efficiently, generally 10 to 15 bytes per number. As shown in Figure 1, a server with 500MB of memory can store approximately 45 million input numbers.
|Figure 1: Real-Time Applications|
|Amount of RAM (MB)||Populated Cells|
It is important to note that because RTAP does not precalculate, the RTAP database is typically 10 to 25 percent the size of the data source. This is because the data source typically requires approximately 50 to 100 bytes per record, possibly more. Generally, the data source stores one number per record that will be input into the multidimensional database. Because RTAP stores one number (plus indexes) in approximately 12 bytes, the size ratio between RTAP and the data source is typically between 12/100=12 percent and 12/50=24 percent.
The second reason applications generally fit into memory when using a RTAP architecture is due to the very high sparsity previously mentioned. With sparsity typically 99 percent or greater in models with five or more dimensions, the 45 million actual values that a .5GB server can accommodate represents a model with a theoretical volume of more than four billion cells. Few financial multidimensional models approach these data volumes. A large financial model typically consists of a few million cells. This RTAP avoids the long calculation times and data explosion associated with other approaches. This is especially important for dynamic applications in which immediate access to the results of incremental data changes is critical. There is one other consideration that deserves attention when choosing an OLAP architecture for a given application.
Hypercube vs. Multicube
Just as in a relational database, where the data is typically stored in a series of two-dimensional tables, the data in an OLAP database consists of values with differing dimensionality. For example, consider the case of a bank that wants to analyze profitability by customer and product. Certain revenues may be dimensioned by customer, product, time, geography and scenario. But costs such as back-office salaries might be dimensioned by cost type, cost center and time. Thus, a real world multidimensional database must be capable of storing data that has differing dimensionality in a single logical database that presents a single conceptual view to the user.
It is not acceptable to insist that all data in the model be dimensioned by the same dimensions. Putting data into an element called "no product" because the data item happens not to be dimensioned by product merely confuses the user. Products that cannot support a multicube architecture with the ability to logically relate to cubes forces unreasonable compromises on the developer of sophisticated models. Not only does it make the access to the data less logical, but it can have serious adverse effects on database size and retrieval efficiency.
ROLAP products are complex and, therefore, costly to set up and maintain. They do not provide the performance available from the other OLAP architectures. For these reasons, most ROLAP applications are for analysis of large volumes of data in which the close integration with the RDBMS is an advantage.
MOLAP products use precalculation to achieve impressive retrieval speed. This approach works well for static and mostly static applications, but the resulting long calculation times make them unsuitable for dynamic applications that require access to data soon after incremental updates. In addition, the data explosion caused by the precalculation approach may make MOLAP inappropriate for large applications with more than five dimensions.
RTAP is the optimal architecture for dynamic applications that require frequent input of data or analysis of what-if scenarios. These type of applications include financial reporting and consolidation, budgeting, forecasting and product profitability. RTAP is also the optimal architecture for supporting a mobile or distributed environment. Because RTAP does not precalculate, the resulting small data footprint makes it very suitable for laptops and even distributing applications via e-mail. Finally, RTAP's extreme scalability makes it well suited for applications with more than five dimensions where MOLAPs would explode the data beyond a manageable size.
1 Codd, Dr. E.F. and Codd, S.B. Providing OLAP (On-Line Analytical Processing) to User-Analysts: An IT Mandate, 1994.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access