DEC 1, 2004 1:00am ET

Related Links

When Fast is Not Enough
July 18, 2008
TopQuadrant Software Imports Email MetaData into Semantic Applications
March 26, 2008
An Open Challenge to the Open Source Community
November 30, 2007

Web Seminars

Data Replication for Real-time (Big) Data Warehousing
Available On Demand
Improving your Overall Analytical Environment by Migrating to a New Data Warehouse Platform
Available On Demand
The Dynamic Duo of Data Warehousing and Real-Time Streams
Available On Demand

Data Warehousing Infrastructure: Linux Clusters for the Enterprise

Print
Reprints
Email

Gone are the days when your business could justify a data warehouse based on ROI alone.

Traditionally, businesses needed a large initial investment to build a data warehouse that satisfied the requirements of a growing enterprise and generated the desired return on investment (ROI). As long as the projected ROI covered the cost, not even the capital-intensive proprietary hardware systems or large SMP servers were considered significant. But recently, a tough economy has forced businesses to look for additional value from their technology investments. That's why today many IT and business managers are focusing on total cost of ownership (TCO) rather than solely on ROI when it comes to defining what makes a data warehouse project successful.

Linux clusters are emerging as the de facto choice for providing enterprise-level scalability, performance and availability for data warehousing solutions at a low cost.

The Pillars of Data Warehousing

The pillars of a successful data warehouse infrastructure are performance, availability, scalability and cost. A crack in any one of those pillars can put your data warehousing project in jeopardy. Performance must be good enough to handle the throughput of large numbers of concurrent, typically very complex, queries on top of a rapidly growing data warehouse - without forcing users to waste hours waiting for queries to execute. Availability must be 24x7 - regardless of hardware or software failures - to avoid costly downtime and to meet the service level objectives that users expect from their transaction processing systems. Scalability is critical to any business that wants to grow, but the uncertainty of the business environment makes it difficult for managers to predict data warehouse requirements. They should implement a flexible infrastructure that can quickly and easily expand on demand.

Businesses want all of this at a fraction of the cost of expensive proprietary infrastructure solutions, and they don't want to have to justify large upfront investments for IT capacity that may not be utilized immediately. Additional costs such as managing infrastructure, accessing skilled resources and training should also be kept low.

Emergence of Enterprise Linux Clusters

Linux clusters can help businesses build a high performance, reliable and scalable enterprise data warehouse infrastructure capable of scaling to handle tens of terabytes of data at a dramatically lower cost than that of proprietary solutions. Businesses of all sizes can benefit from the many advantages of an enterprise data warehouse built on low-cost Linux clusters.

There is hardly any need to explain the price/performance benefits of Linux systems to IT managers, as it is highly likely that they're already benefiting from Web, e-mail and file servers running on Linux in their data centers. Linux customers enjoy the low total-cost advantages of commodity hardware, such as competitive pricing, elimination of proprietary extensions and transferable skills.

While IT managers were initially reluctant to deploy mission-critical applications on Linux, those doubts have become a concern of the past with the availability of enterprise versions of Linux from leading vendors such as Red Hat and SUSE, and the support of Linux from major enterprise software and hardware vendors such as Oracle, HP and IBM. To support data warehouses, major improvements have been made to the Linux kernel with regard to I/O throughput, memory utilization, scalability, reliability, manageability and clustering through contributions from these vendors. A data warehouse can take advantage of Linux clusters to deliver the high performance, scalability, availability and low TCO demanded by today's businesses.

Scalability Requirements

The key requirement for scalability is that the data warehouse must be able to support larger volumes of data and more users by adding additional hardware - without degrading performance. While proprietary servers often support large numbers of CPUs (up to 60 or 72 CPUs per server), a single commodity server supports a much smaller number of CPUs (up to 8 CPUs). Therefore, commodity servers running Linux require clustering to scale.

Clustered data warehouses can be broadly categorized into two types: shared-nothing architectures, where each server accesses its own set of disks and database files, and shared-everything architectures, where each server can access all available storage. The basic premise behind both cluster architectures is that scalability is achieved by permitting data warehouses to grow with the addition of new servers to the existing cluster.

Instead of "forklift" upgrades where the incremental cost of scaling to a larger server can be very expensive, businesses can scale their data warehouse by incrementally adding low-cost servers to their cluster as demand grows. When increasing the size of an existing shared-nothing cluster, new storage and servers are typically purchased at the same time, and the existing database is reorganized to be distributed across the old and new servers. Shared-disk architectures offer more flexibility for growth, because servers and storage may be added when the need for either arises without requiring database reorganization.

Enter the Grid

Interesting new capabilities are being introduced to Linux clusters in the area of grid computing. A grid is essentially an enterprise-wide cluster supporting multiple applications. A key characteristic of a grid is provisioning, which allows businesses to add or remove CPUs as needed - even while the application is running. Database vendors are starting to support such capabilities. Databases can recognize new CPUs instantly and dynamically allocate workload to new nodes based on business priorities.

For example, an enterprise grid could automatically and dynamically add CPUs to its data warehouse during a heavy overnight load process or during periods of high query activity and remove those CPUs from the data warehouse when they are no longer needed. Thus, provisioning will require a shared-everything architecture. One of the promises made by Linux running on commodity servers is that it makes grid architectures a reality. The widespread availability of inexpensive industry-standard Linux servers enables companies to run entire enterprises on large grids. With some of the largest-scale Linux clusters in commercial grid computing, data warehouse applications are leading the adoption of grid computing.

Performance

Filed under:

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.