For those data warehousing clients with requirements that are multiterabyte, high-end decision support, requiring superior price/performance across significant numbers of users, massively parallel processing (MPP) will remain an operational necessity. Although symmetric multiprocessing (SMP) is raising the price/performance bar and pushing the crossover point upward, SMP and MPP will coexist in a gray transition area at the high end; and clients will benefit from the choices and competition. Many more data warehousing clients will be able to make use of standard SMP approaches than had previously been the case, but those with multiterabyte volumes combined with high-performance requirements (active data warehousing) will still require a special-purpose data warehouse server.
Innovations in memory-to-CPU interconnection, such as crossbar switching, have reduced memory contention in SMP designs and have reduced the coordination costs of the hybrid clustered- SMP approach. However, slope of one linear scalability of hundreds of processors still requires an MPP database. The central trade-off between single image (SMP) and parallel processing (MPP) database warehousing servers is between ease of administration and scalability. Eventually the coordination costs of refreshing and synchronizing cache start to render the scalability of SMP less than one. The addition of another processor does not produce a full processor's quota of work and throughput due to coordination costs. A similar dynamic can also affect shared disk clusters where, absent an abstraction layer to map disks to nodes, a global lock or database synchronization mechanism is needed to preserve data integrity. In contrast, while MPP scales linearly to hundreds of nodes, troubleshooting so many processors can be an issue for administrators trained in the SMP world. More moving parts reduces the mean time between failure (MTBF). Advocates of SMP versus MPP (e.g., HP, Sun, IBM versus Teradata, IBM) argue that the performance cost of data movement through the high-speed switch (a defining characteristic of clustered hardware and MPP databases) can be significant, and data placement remains a critical success factor. This is true, but it is the required trade-off for high-performance results given complex queries against large volume points. Further trade-offs include:
- MPP offers superior scalability of computing power and throughput; SMP offers the best price/performance, especially below the gray area in Figure 1.
- MPP has (regardless of the DBMS) an issue of leaving MIPS unused throughout the day. (Consider a perfectly distributed MPP database. Now run a query that joins two tables on "date." Redistribution occurs, and more data get hashed to certain nodes. Real-time imbalance occurs.) SMP overcomes this imbalance if the software is dynamically able to allocate parallel tasks within a single node (DB2, Oracle– intrapartition parallelism.
- MPP has more "moving parts," hence, potentially worse MTBF numbers diminished in the cases of true fault-tolerant architectures such as Compaq Himalaya (with nonStop SQL), IBM DB2 EEE with HACMP or Teradata (NCR).
Figure 1: SMP to MPP Crossover for DW Workloads
As a result of these trade-offs, a move toward clustered architectures is occurring with SMPs as the building blocks, resulting in a hybrid configuration. (The need for a high-speed switch and the implied complexity means that the hybrid approach is a special-purpose solution.) This allows clients to leverage the price/performance, reliability and real-time memory and CPU management of the SMP world with the superior scalability promise of the MPP approach. Clustered approaches with shared disks (for example, Oracle9i Real Application Clusters) will be preferred by clients for high availability, and clustered approaches with pure shared- nothing databases will be preferred for scalability and performance pure and simple.
In general, those installations that favor consolidation on a single SMP hardware/database platform will look to leverage the new 64-bit non- clustered servers from Sun, HP and IBM with a standard, non-parallel relational database, such as IBM's DB2 UDB or Oracle8i/9i. Those enterprises with data volumes in excess of 3TB (and growing) and with complex, high- performance active data warehousing requirements will require special-purpose database servers, such as Teradata (on NCR WorldMark hardware) or the Regatta with the IBM SP framework with DB2 EEE. (For clients already on a special- purpose data warehouse, such as Red Brick or Informix XPS, the IBM acquisition should be a source of assurance. Those clients can not only rely on continued IBM support for the products, but can also look forward to future enhancements.)
Special situations do arise that require a dedicated data warehouse database. The bar on those situations continues to rise as SMP processors handle terabyte volume points. However, the situations still exist. These include:
- Volume points above 3TB of raw data.
- A service level agreement (SLA) requiring rapid responses in real time.
- Significant volumes of updated activity to the data warehouse concurrent with inquires.
Enterprises will need to understand the data profile of the data warehousing application in question – especially with reference to situations two and three – and trace their requirements to the particulars of the available platforms. In general, multiterabyte volumes, high-performance SLAs and active data warehousing will continue to require special-purpose decision support servers.
Lou Agosta is an independent industry analyst in data warehousing. A former industry analyst at Giga Information Group, Agosta has published extensively on industry trends in data warehousing, data mining and data quality. He can be reached at LAgosta@acm.org.