Data Inundation, Impatient Users: Welcome to the Data Warehouse Manager's Dilemma

A 2004 Forrester Report stated it rather succinctly - "In time, simple data warehousing systems can be made ultraresponsive, but when both complexity is increasing and the time horizon is shrinking, costs escalate rapidly. In data warehousing systems, cost is an inverse function of data volumes, complexity and frequency of update."1

Add to this a rapidly expanding "Google-conditioned" user community expecting instant answers to anything, and you have described the data warehouse manager's dilemma: how do you cost-effectively maintain service level agreements in the face of growing data, growing demand and growing complexity?

Business intelligence (BI) is one of the business success stories of this decade. Once organizations have a taste for data-driven management, there is no stopping them. Data warehouses that underpin BI have grown from gigabytes to terabytes and are expanding as much as 100 percent a year by some estimates, while the demand for standard reports is being superseded by more ad hoc requests and more exploratory class analysis. What are the options for coping with this success?

SQL Server and Scalability

Parallel to the growth in data warehousing (DW) has been the emergence of SQL Server as a BI platform of choice. In terms of number of instances, SQL Server is approaching 50 percent of the DW market, and DW/BI accounts for approximately one-third of SQL Server sales. Though a late arrival on the relational database scene, Microsoft astutely recognized that the analytic processing market would grow to rival transaction processing and took steps to specialize SQL Server for the task. SQL Server 2005 certainly sets the features bar for other databases, with Analysis Services, Reporting Services and Integration Services providing functionality that rivals the best from the specialist, OLAP, BI and ETL vendors.

The reality is that SQL Server does not have the scalability credentials to match its functionality leadership. What are the options for organizations looking to cost-effectively scale their SQL Server warehouse as their organizational demand scales other than the disruption and cost of migrating to a competitive database?

Fully Exploit Analysis Services

Analysis Services is designed to deliver performance to the BI desktop - and it does that extremely well within its specified design parameters. By utilizing Analysis Services where appropriate, you can provide better performance than by direct database access for many kinds of routine queries and standard reports. However, you must also accept that there are limitations to the OLAP-centric approach of Analysis Services. For example, cubes can get cumbersome to inventory and manage, require time and resources to build and, generally, provide access to a preconceived snapshot of data rather than a pure ad hoc exploration.

Focus on Warehouse Indexing

The costs of indexing and other performance-tuning techniques are high. They are space-intensive and of diminishing return. More indexing doesn't necessarily yield the same percentage performance boost as the initial indexing - these techniques are restrictive in that they boost performance only to certain designated fields and offer no help to others. They are time-intensive: they take time to design, build, update and maintain, and they add immensely to the load and update times of the data warehouse, impeding high availability.

Upgrade to a Larger Server

More iron is always a possibility and often the chosen path. Yet the traditional SMP server (symmetrical multiprocessor) deployed for most data warehouses is not typically CPU-bound by the application. As you scale, you start to enter the territory of enterprise edition pricing for SQL Server, and those high-performance 32-way boxes are approaching the six-figure mark - beyond the budget range of a typical SQL Server shop.

Move from SMP to MPP

The nature of the majority of data warehouses is that they run an SMP architecture (often called a share everything architecture) where, regardless of the number of processors, the memory bus and disk channels are shared by all processors, limiting throughput speeds. An alternative is the massively parallel processor (MPP) architecture (also called a shared nothing architecture) that specifically pairs processors, memory and disk to maximized throughput. The art in MPP design is in assigning the task and orchestrating the nodes to work together. This has proven too difficult for most general-purpose computing, but it is quite feasible for a definable task domain like data warehousing.

The advantage of MPP for a data warehouse is that it optimizes the machine resource so that it performs well against what may be suboptimized - while SMP optimizes the data (through indices, cubes, etc.) to work with a suboptimized server and database configuration. An MPP data warehouse performs well without the need for indexes, cubes and materialized views, and without restricting the users to only some classes of queries or certain preselected fields. It is essentially "load and go," so warehouses are up and running faster, updated faster, maintained with less resources required for tuning and scale in a linear fashion so benefit is received from 100 percent of any CPUs added.

Not All MPP is Equal

MPP data warehouse solutions have been around since the 1990s, but until recently, they were based on proprietary hardware systems and priced at a level that only the Wal-Marts of this world could afford. Recently, a new class has emerged that takes advantage of the high-performance off-the-rack hardware now available. There are two general categories - the appliance sold as a hardware/software bundle and software-only solutions, which are further divided into vendor-developed software and open source.

The appliance vendors provide a configured system based on their own proprietary assembly of components. The appliance is attractive in that it offers one-stop shopping, but it does raise questions about corporate hardware standards and raises concerns about the vendor's ability to always provision underlying hardware that takes full advantage of the rapidly improving server price/performance. Appliances tend to come in a limited number of rack configurations, so you are likely to have to buy future capacity in advance rather than adding more blades incrementally as your business demands the additional performance. Software-only products, on the other hand, invite integration and configuration issues as well as the open source dynamic coming into play.

There are also two basic deployment models - vendors who offer to replace your data warehouse and those who add on to or extend the underlying warehouse. The data warehouse replacement negates corporate database-of-record standards, opens questions about betting your data on a smaller vendor and requires that applications be ported - in SQL Server's case, at the expense of losing specialized BI functionality and having to reprogram applications to support a proprietary open database connectivity driver.

Choosing MPP for SQL Server Environments

Exploring an MPP companion approach makes great sense for SQL Server users who need to build in performance headroom. MPP offers linear performance scalability at a lower operations cost, eliminates the tedious tasks of routine indexing for performance and doesn't restrict users as to what questions they can ask of what data. The caveat is that it is a companion approach rather than a replacement strategy. You do not lose the functionality of Analysis Services for those analytics where it makes most sense and you maintain all the SQL Server systems management, backup, etc. that you have invested in to date. A rip-and-replace appliance can be as problematic in terms of application integration as moving to Oracle on an SMP server and makes little sense considering how much SQL Server brings to the table.

With the cost of rack servers continuing to fall and their performance continuing to rise, MPP solutions - provided they allow you to buy hardware capacity as you need it and not in advance - represent a high-performance, nondisruptive, cost-effective, scalable answer for SQL Server users to the data warehouse manager's dilemma.

Reference:

  1. Lou Agosta and Keith Gile. "Factors Driving Data Warehousing Response Time." Forrester, 2004.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access