Shahla Butler would like to thank her colleague, Tom Loukas, for contributing this month's column.
Athorough understanding of the database management system (DBMS) is critical to a successful data warehouse implementation. The role of the DBMS is to provide a robust storage mechanism that can support efficient storage and retrieval of large amounts of data. This column presents some of the major database management considerations for data warehousing and business intelligence.
The first consideration involves selecting the DBMS and the database server platform. There are two categories of DBMS that can be considered for data warehousing and data mart application s. You may consider a standard relational database or choose a database specifically optimized for analytical processing. This second category includes multidimensional databases and column-oriented relational databases. Many installations use a combination of the two, storing the detailed data in a relational database and using a multidimensional database for certain data marts with well-known analysis patterns. Multidimensional databases usually yield much faster query performance at the expense of longer loading times, size limitations and less flexibility in handling new access requirements. Whichever DBMS or combination of DBMSs you choose, evaluate the server platform at the same time you are evaluating the DBMS to see how well they perform together. A DBMS architected for a shared-everything symmetric multiprocessing environment might not perform as well on a clustered or massively parallel server as a DBMS specifically designed for those platforms. The reverse may also be true .
Also ensure that the DBMS you are considering has the necessary properties to support the type of queries generated by the suite of BI tools that will be employed. Some often desirable properties include sophisticated optimization, query parallelism, mechanisms to partition large tables and resource governors. Some RDBMS vendors are now building summary table functions (i.e., facilities to monitor the need for creating such tables) and query rewrite capabilities that detect when a request should be processed against a summary table instead of an atomic table.
When considering disk storage requirements, look at space needs beyond that of the raw data. When you add up the space required for indexes, mirroring, temporary tables, sort space, summary tables and staging areas for loading, a figure of three to four times the space required by the raw data is not uncommon.
Once the platform and DBMS are chosen, ensure that experts in the selected DBMS are available to the project t eam. The combination of very large databases, ad hoc access and tight load windows that are typical of business intelligence applications demand a high degree of skill in the DBMS to ensure a reasonable degree of performance.
The next consideration is to allow time for performance prototyping during the design phase. Even experts in a particular DBMS may run into unexpected performance problems when dealing with large databases. I have seen cases where weeks of prototyping were required to optimize the load process. In one case, many different combinations of indexing, partitioning and load strategies had to be tested to determine the best method to achieve a timely, nondisruptive load process; but the result was that total load time was reduced dramatically and the need for massive index rebuilds on a regular basis was eliminated.
An often overlooked consideration is that the purge, load and backup process should all be considered during the database design. If you wait until af ter the data warehouse is in production to design the purge process, you may find that the options for accomplishing the purge are limited. Many installations have shown that using time-based partitioning on large tables can facilitate purge and backup processes.
Finally, determine whether the business intelligence or knowledge management activities will involve the analysis of unstructured large objects such as text documents or images. The advent of text mining technology and other sophisticated processing on large objects presents new challenges for storage, retrieval and indexing technologies. Should those objects be managed using a relational DBMS? If so, should they be stored within the RDBMS or externally with linkage to the RDBMS for certain operations? If stored in an RDBMS, certain processes such as logging may be optional or prohibited for objects over a certain size. If logging is not used for those objects, what recovery mechanisms are needed? If not stored in an RDBMS , what mechanisms does the chosen storage solution provide in terms of backup, recovery, update integrity, security, indexing, performance monitoring and tuning, etc.? These are among the data management issues being tackled by those who are extending their business intelligence activities to include the automated analysis of unstructured data.
Selecting and designing the physical data storage mechanisms needed for effective business intelligence and knowledge management activities are very involved processes that require a great deal of specialized experience and expertise. This column has presented just a few of many areas that need to be addressed to ensure that the data storage platform is capable of adequately supporting the needs of the BI/KM users and their tool suites.
Tom Loukas is a senior principal with American Management Systems, Inc. and the director of the Database Management Laboratory at the AMS Center for Advanced Technologies (AMSCAT). Loukas can be reached via e-mail at email@example.com.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access