To go where no data warehouse has gone before ...
The exploration warehouse (EW) was introduced in 1998 as a new component to the Corporate Information Factory (see Bill Inmon's article, "The Exploration Warehouse," DM Review, June 1998). Its purpose is to provide two unique capabilities that are generally very difficult to create a prototyping environment and a safe haven for the exploration of data and ad hoc analytical processing. Although the EW is a valid and necessary part of the Corporate Information Factory (CIF), in practice a difficulty arose due to the limited technology to support these capabilities.
To facilitate these two capabilities, the exploration warehouse must be based on technology that can deal with the unknown in a DSS environment. For prototyping, it must be flexible enough to handle the constant probing by designers and analysts. For exploration and ad hoc analytical processing, it must be able to handle the unstructured processing of the explorer with suitable performance. In either case, the nature of the exploration warehouse is temporary and transitory. It is rare that the exploration warehouse remains a permanent structure, and it is constantly being constructed and reconstructed. The question then is what technologies can be used for this unusual but extremely valuable component of the CIF?
This article describes four technologies that show great promise in building and using the exploration warehouse. They are:
- Encoded vectors
The pros and cons of each technology will be discussed along with examples of vendors having these technologies. It is important to note that many of the vendors discussed use a combination of one or more of these data structures, but we are highlighting the one that makes them unique.
Exploration Warehouse Data Structures
To accommodate the "throw away" nature of the exploration warehouse, a technology must support easy and speedy development. Second, the technology must be able to handle large amounts of data with reasonable response times. This dichotomy has been a problem in the past with traditional relational databases. Tearing down and rebuilding these data structures is no easy feat. Therefore, new and different structures in support of exploration are needed. When assessing the usage of one of the four discussed here, you should keep in mind the following characteristics. The technologies should:
- Be able to load data quickly,
- Be able to accept design changes handily,
- Provide superb response times to complex and massive queries, and
- Be reasonably priced.
The Four Technologies for Exploration Warehouses
Two of the technologies use data compression technologies to accomplish the stated requirements of the exploration warehouse. One is an in-memory database and another uses a new form of indexing strategy.
Token-Based Data Structures
A token is a code or a symbol used in place of an actual data value. Token-based data structures convert the data into a series of these tokens and then store these tokens rather than the actual data itself. For example, redundant data fields such as state, name, city or date fields have tokens created that replace the actual values. A bit-map index is then created for each column of tokens. The volume required for storing the records is significantly decreased. Not only can you store massive amounts of data in a relatively small amount of disk, but also the reduced size enables the complex, ad hoc type of queries to have rapid response times because the data can be stored in main memory. An example of a token-based data structure is Nucleus from Sand Technology Systems (www.sandtechnology.com).
Token-based data structures are good for prototyping and exploring because the users can rapidly understand their data and make a quick determination whether the design of the data types is optimal for their usage. Due to the nature of the data compression, the lack of tuning required and the ensuing fast response times, users can expeditiously see trends or patterns in the large amounts of data a feat critical to true exploration.
There are size limitations to this form of data structure; the upper limits appear to be about 200 to 250GB of raw data. Also, the algorithms to "tokenize" the data must be developed and may slow down the loading of data into the database and in the translation process. Finally, because this form of data structure is not tunable, it is not appropriate for commonly used or routine queries.
In-Memory Data Structures
The second technology is in-memory resident data structures. The product WhiteCross Data Exploration from WhiteCross Systems is an example of this technology (www.whitecross.com). Their technology takes advantage of their proprietary hardware and relational database to create an environment in which the exploration data is stored in memory. The advantages of this approach are the extremely fast response times to queries and a very simple indexing strategy you create one bit-mapped and compressed index for the entire table.
This simplifies and speeds up the load process because you are updating a single index. Their massively parallel processing architecture has been built specifically for data exploration and can perform full table scans, joins, unions, merges and sorts rapidly. A final advantage to this type of exploration warehouse is that there is no limitation to the size of the data you store it is only limited by the available memory.
On the disadvantage side, you must be prepared to deal with the proprietary nature of their hardware and software. Secondly, RAM is not inexpensive! The price tag for large exploration warehouses can be high.
Bit-Mapped Data Structures
Bit-mapping indexes have been around for several years now. A good example of this technology is Dynamic Information Systems Corporation DISC (www.disc.com/home). Their OMNIDEX product uses a combination of bit-map indexing for low cardinality columns and inverted list indexing for high cardinality columns. These two form the basis for their multidimensional index. They also use an aggregation index to calculate aggregations dynamically. This is especially important to understand as it eliminates the need to create physical aggregation tables, thus reducing the overall disk requirements.
There are no extraordinary in-memory requirements with this technology, and it runs on all major relational DBMSs and flat files. These indexes greatly improve the performance of queries because the speed is dependent on the number of records found, not on the size of the database itself.
Usage of this technology causes a significant paradigm shift in the thinking of the technical people supporting it. It requires a return to a simpler time in that the need for snowflakes, aggregation tables, constellations and other complex designs used to improve performance is completely eliminated.
OMNIDEX works well for tables that contain less than 500 million rows of data. Above that size, you will need to rethink your partitioning and other DBMS maintenance issues.
Encoded Vector Data Structures
The encoded vector index (EVI) was introduced for DB2/UDB by IBM early this year and is an advanced form of bit-map indexing. The EVI is similar to the bit-mapped index, except that each distinct occurrence will have an entry in the EVI. Stored with the EVI is a symbol table with the row ID of the first and last occurrence of each value and the number of occurrences. An EVI can be four times smaller than a normal b-tree radix index, which can be significant if disk space is an issue. The EVI is very compact and usually scans faster than the traditional DB2 index scan. EVIs also work well for conditional range queries.
EVIs do not have the same size limitation of bit-map indexes and are more scalable as a structure. Because of an 8:1 advantage in the favor of bit-map indexes due to the size of an EVI, it works for columns with a large number of recurring distinct values or where range-checking queries are significant. EVIs can be built in parallel at about 10 to 30 percent faster than normal radix indexes and can be built using multiple columns. The right combination of b-tree radix and encoded vector indexes is very important to gain an efficient access method to the data via the optimizer.
EVIs do have a size limitation based on DB2/UDB, so partitioning of the data may be required with larger data sets. It is also relatively new technology and may require benchmarking for performance, size and coordination with other indexes.
The exploration warehouse has now become an important part of the overall Corporate Information Factory architecture for both prototyping and unstructured, ad hoc queries. We are fortunate today to have multiple technologies supporting the creation this structure. The four technologies highlighted in this article have proven to be suitable data structures for exploring, having the proper characteristics of loading the data quickly, yielding remarkable response times, accommodating changes easily and being priced within the range of most IT budgets. You may now choose from token-based, in-memory, bit-map or encoded vector technologies for your exploration warehouse. Each has its pluses and minuses so choose the one that best fits your existing technology and business solutions.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access