Continue in 2 seconds

Data Warehousing Infrastructure: Linux Clusters for the Enterprise

  • December 01 2004, 1:00am EST

Gone are the days when your business could justify a data warehouse based on ROI alone.

Traditionally, businesses needed a large initial investment to build a data warehouse that satisfied the requirements of a growing enterprise and generated the desired return on investment (ROI). As long as the projected ROI covered the cost, not even the capital-intensive proprietary hardware systems or large SMP servers were considered significant. But recently, a tough economy has forced businesses to look for additional value from their technology investments. That's why today many IT and business managers are focusing on total cost of ownership (TCO) rather than solely on ROI when it comes to defining what makes a data warehouse project successful.

Linux clusters are emerging as the de facto choice for providing enterprise-level scalability, performance and availability for data warehousing solutions at a low cost.

The Pillars of Data Warehousing

The pillars of a successful data warehouse infrastructure are performance, availability, scalability and cost. A crack in any one of those pillars can put your data warehousing project in jeopardy. Performance must be good enough to handle the throughput of large numbers of concurrent, typically very complex, queries on top of a rapidly growing data warehouse - without forcing users to waste hours waiting for queries to execute. Availability must be 24x7 - regardless of hardware or software failures - to avoid costly downtime and to meet the service level objectives that users expect from their transaction processing systems. Scalability is critical to any business that wants to grow, but the uncertainty of the business environment makes it difficult for managers to predict data warehouse requirements. They should implement a flexible infrastructure that can quickly and easily expand on demand.

Businesses want all of this at a fraction of the cost of expensive proprietary infrastructure solutions, and they don't want to have to justify large upfront investments for IT capacity that may not be utilized immediately. Additional costs such as managing infrastructure, accessing skilled resources and training should also be kept low.

Emergence of Enterprise Linux Clusters

Linux clusters can help businesses build a high performance, reliable and scalable enterprise data warehouse infrastructure capable of scaling to handle tens of terabytes of data at a dramatically lower cost than that of proprietary solutions. Businesses of all sizes can benefit from the many advantages of an enterprise data warehouse built on low-cost Linux clusters.

There is hardly any need to explain the price/performance benefits of Linux systems to IT managers, as it is highly likely that they're already benefiting from Web, e-mail and file servers running on Linux in their data centers. Linux customers enjoy the low total-cost advantages of commodity hardware, such as competitive pricing, elimination of proprietary extensions and transferable skills.

While IT managers were initially reluctant to deploy mission-critical applications on Linux, those doubts have become a concern of the past with the availability of enterprise versions of Linux from leading vendors such as Red Hat and SUSE, and the support of Linux from major enterprise software and hardware vendors such as Oracle, HP and IBM. To support data warehouses, major improvements have been made to the Linux kernel with regard to I/O throughput, memory utilization, scalability, reliability, manageability and clustering through contributions from these vendors. A data warehouse can take advantage of Linux clusters to deliver the high performance, scalability, availability and low TCO demanded by today's businesses.

Scalability Requirements

The key requirement for scalability is that the data warehouse must be able to support larger volumes of data and more users by adding additional hardware - without degrading performance. While proprietary servers often support large numbers of CPUs (up to 60 or 72 CPUs per server), a single commodity server supports a much smaller number of CPUs (up to 8 CPUs). Therefore, commodity servers running Linux require clustering to scale.

Clustered data warehouses can be broadly categorized into two types: shared-nothing architectures, where each server accesses its own set of disks and database files, and shared-everything architectures, where each server can access all available storage. The basic premise behind both cluster architectures is that scalability is achieved by permitting data warehouses to grow with the addition of new servers to the existing cluster.

Instead of "forklift" upgrades where the incremental cost of scaling to a larger server can be very expensive, businesses can scale their data warehouse by incrementally adding low-cost servers to their cluster as demand grows. When increasing the size of an existing shared-nothing cluster, new storage and servers are typically purchased at the same time, and the existing database is reorganized to be distributed across the old and new servers. Shared-disk architectures offer more flexibility for growth, because servers and storage may be added when the need for either arises without requiring database reorganization.

Enter the Grid

Interesting new capabilities are being introduced to Linux clusters in the area of grid computing. A grid is essentially an enterprise-wide cluster supporting multiple applications. A key characteristic of a grid is provisioning, which allows businesses to add or remove CPUs as needed - even while the application is running. Database vendors are starting to support such capabilities. Databases can recognize new CPUs instantly and dynamically allocate workload to new nodes based on business priorities.

For example, an enterprise grid could automatically and dynamically add CPUs to its data warehouse during a heavy overnight load process or during periods of high query activity and remove those CPUs from the data warehouse when they are no longer needed. Thus, provisioning will require a shared-everything architecture. One of the promises made by Linux running on commodity servers is that it makes grid architectures a reality. The widespread availability of inexpensive industry-standard Linux servers enables companies to run entire enterprises on large grids. With some of the largest-scale Linux clusters in commercial grid computing, data warehouse applications are leading the adoption of grid computing.


Performance of any system is driven primarily by its CPU power and I/O bandwidth. Commodity hardware systems, based on Intel architecture CPUs, deliver industry-leading price/performance. It is no longer necessary for businesses to spend millions on large proprietary data warehouse servers when superior complex query performance can be achieved on a much smaller budget using the flexibility of a Linux cluster. The TPC-H benchmark, a standard benchmark measuring complex database query performance (, demonstrates this. A recent TPC-H benchmark on Linux clusters achieved an impressive performance of 22,387.9 QphH@3000GB with an unmatched price/performance of $93/QphH@3000GB.1 Linux clusters have met the performance requirements of even the most demanding businesses.

High Availability

Data warehousing has evolved from systems primarily used for strategic planning into mission-critical applications used in daily operations. A failure in the data warehouse is no longer just an inconvenience to internal analysts. It has the potential to cause a costly inability to serve customers and manage internal operations. A data warehouse based on Linux can eliminate concerns about downtime and ensure 24x7 availability.

Clustering, the key enabler for scalability, is also a crucial component for availability. A single commodity Linux server may have a shorter mean-time-to-failure than a high-priced proprietary server; however, a cluster of Linux servers providing failover capabilities delivers 24x7 enterprise availability. Shared-everything architectures are ideal for high-availability because all nodes equally share access to all disks. When a server in a cluster becomes unavailable, the remaining servers continue to function uninterrupted, automatically picking up the workload of the failed server. Also, the linear scalability model of Linux clusters eliminates guesswork regarding the actual power added or removed when a new CPU is added or removed. For example, if one node fails in an 8-node cluster, then the data warehouse system performs at 7/8th the capacity of the full cluster.

TCO Advantage

The overriding factor for building a data warehouse on a Linux cluster is a significantly lower TCO. Cost is the driving consideration for every data warehouse implementation on Linux today. The benefits accrued from using Linux clusters are in the utilization of commodity, Intel-based servers. Moreover, further cost benefits are recognized when these servers are matched with low-cost storage solutions. An early adopter of an 8-node Linux cluster cited almost a 14x cost savings compared to their proprietary shared-nothing data warehouse solution. The benefits of low-cost storage are even more dramatic, with that same customer noting a 19x cost savings on the storage alone.

Additionally, replacing a failed server is cheap, fast and easy. The "pay as you grow" model of a scale-out approach enables businesses to avoid nearly instant obsolescence inherent in high-end systems and take advantage of Moore's law. Buying a server two years down the line offers hardware that is more powerful at lower cost. This model allows businesses to align their IT capital spending with their business growth. Business managers can do well by investing savings in their core business.

Linux Clusters are Right for Data Warehousing

A data warehousing infrastructure built on a Linux cluster provides enterprise-level scalability, performance and availability at an unbeatable low cost. With Linux clusters, organizations can efficiently scale-out operations and make their data warehouses available 24x7. Businesses can avoid investing in excess computing power at premium prices and benefit from the lower cost of open-standards-based computing. Finally, customers can obtain all the benefits of a Linux cluster data warehouse without incurring excessive management overhead. The TCO of a Linux cluster-based data warehousing infrastructure is significantly lower than the same solution implemented on a large SMP box or a proprietary shared-nothing cluster. With all of these benefits, it is not surprising that enterprise data warehouses built on low-cost Linux clusters have emerged as an attractive alternative to proprietary data warehouse solutions.

1. Eight-node HP ProLiant DL740 Cluster each with 4 Intel Xeon MP 3.0 GHz processors running Red Hat Enterprise Linux AS 3 and Oracle Database 10g with Real Application Clusters available March 02, 2004.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access