Last month, I looked at the first step of capacity planning for a scalable data warehouse environment, which was to estimate the business requirements that your warehouse must satisfy. Though such requirements can be stated in many ways, they can usually be classified as affecting one or more of the following "business drivers:" raw data size, average query complexity, desired query response time and the number of concurrent queries. This month, I'll look at the second step of the process, which uses these business drivers to determine an actual technical configuration for your platform, and then I'll conclude with a discussion on iterating this configuration to address the issues that arise due to complex interactions between each of the components in your configuration. STEP 2: DETERMINING THE TECHNICAL CONFIGURATION.
Having completed Step 1 (Estimating the Business Requirements), we have an understanding of the requirements that need to be met by any potential configuration. The next step is to translate these requirements into a physical, balanced system configuration. This involves balancing five major system components:

  • Total disk capacity
  • Disk subsystem throughput
  • Disk controller throughput
  • CPU processing power
  • Memory capacity

Total Disk Capacity

Total disk capacity is often defined as the ratio between the raw data size and the total database size. The total database size is always larger than the size of the raw data it contains, because it includes the raw data plus things like indexes, system overhead and the database's internal temporary working areas. However, the database size is not just some simple function of raw data size. It is also driven by the number of expected concurrent queries, because each query requires its own working space. For example, if three concurrently executing queries have sub-queries that each return half of the raw database, then the total database size must be at least 2.5 times the raw size. As a baseline rule of thumb, for most warehouse applications with only a small amount of denormalization, the average ratio of database size to raw data size tends towards 3.0 to 4.0. However, if there is a substantial amount of denormalization, then this ratio could be much higher, with the database easily being 10 or more times the raw data size.

Disk Subsystem Throughput

The previous exercise tells you the rough amount of total storage space you need. But, it doesn't tell you how many disks you need. For example, if you need 400GB of disk storage, should you buy 400 1GB disks, 200 2GB disks, or 100 4GB disks? The more disks you have, the higher the I/O throughput you will be able to generate. To determine the required amount of throughput involves looking at the business requirements defined earlier. By first looking at the amount of data that each query must scan and then dividing that by the desired response time for that query, you can determine the I/O throughput required for that query. Next, factor in the number of queries that are executing concurrently, and you can estimate the total I/O throughput needed. Finally, divide the total I/O throughput by the average scan rate capabilities of the particular disk drives you are considering to calculate the actual number of disk drives you will need.

Disk Controller Throughput

To keep the system balanced, the next step is to ensure that you have enough disk controllers to handle the aggregate amount of data coming from the disk I/O subsystem. As a first cut at this, divide the aggregate I/O bandwidth (calculated above) by the average throughput capabilities of the particular disk controllers you are considering. As a rule of thumb, it is usually wise to increase this number by about 20 percent to give you headroom to handle any momentary peaks in I/O throughput demands.

CPU Processing Power

The next step in building your balanced system is determining the number of CPUs that you will need. As a first step, you need to determine how much data an individual processor can process in a given amount of time. Then, because we already know the throughput capabilities of the disk controllers, we can determine how many processors will be needed to handle that throughput.

Though published CPU performance metrics have their limitations, the SPECint95 rating can in fact be useful. Assuming there are no other bottlenecks on your hardware platform, the ability of the CPU subsystem to process data will be nearly linearly related to the total SPECint95 rating. So, the best approach to determining how many CPUs are needed for your configuration is to run a small test of your system (with one or two processors) to determine the relationship between SPECint95 ratings and data processing throughput for your specific query mix. (For example, you might determine that 1 SPECint95 implies an ability to process 2MB per second.) Then, configure enough CPUs to give you the total SPECint95 rating that you need for the desired aggregate throughput.

Memory Capacity

Finally, we turn to memory. This component is a little different from the others, however, because memory doesn't really become a bottleneck. As long as you have the minimum amount of memory to run your database, then any amount of memory beyond that amount has the effect of reducing the workload on the I/O subsystem. The more memory there is to cache data, the less the application will have to generate disk I/Os, and the lower the actual throughput requirements will be for the disks and the disk controllers. As a baseline, for most relational databases, count on having at least 500MB of memory to be able to make effective use of caching in an SMP. In MPP environments, you should have at least 250MB per node if the nodes are uniprocessor and 500MB per node if they're multiprocessor (i.e., SMP) nodes.

Following these guidelines in Step 2 almost seems too easy, doesn't it? Well, as always, beware of things that seem too easy. In truth, these are generic guidelines and will, therefore, lead only to configurations that are balanced for "generic" data warehouse environments. They may not be very suited to your actual workload.

Therefore, the above guidelines should only be used as a starting point for your configuration. After that, you will need to iterate on your configuration for a number of reasons. First, you must tailor and tune your system to the demands of your particular warehouse. For example, do your warehouse users access the same data frequently? If so, then a database's ability to cache parts of the database in memory will be very effective and will reduce the overall disk subsystem throughput requirements-- but only if your memory cache is sufficiently large. If your users have random access patterns to the data, even a large memory cache won't have much effect on the system. Or, how much aggregated data do you have? The more you have, the more disk space it requires, but it will reduce the CPU processing power requirements. Also, depending on the actual access patterns to certain tables, you may want to purchase additional disks and only store a single, frequently accessed table on this set of disks to ensure that the bandwidth available from those disks is entirely dedicated to that one table.

A second reason why you will need to iterate your configuration is that there are complex interactions between all of the system components. That is, changing the capabilities of one component can have an effect on the demands placed on other components. In some cases, it may just move the bottleneck to another component, while in others it may decrease the throughput requirements of another component. For example, the database size to raw data ratio has an effect on the average query response time. If insufficient working space is available, SQL statements will either fail to execute or be blocked from beginning, which will have a detrimental effect on the total database throughput. For similar reasons, the amount of available disk space will also affect the number of concurrent queries that can run. Figure 1 shows some of the potential interactions between business requirements and the various components in the technical configuration.

Also, the simple fact that your data warehouse environment is constantly changing and growing is yet another reason why you will need to iterate on your system configuration. Because we are using scalable technologies, we have an environment where the capabilities of all the major system components can be scaled up accordingly, as dictated by business requirements. To successfully manage a scalable data warehouse, you must treat system configuration not as a one-time exercise, but rather as a continuous process that ensures that your system is not only balanced, but that it is capable of meeting the changing needs of your organization.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access