A critical part of any application development process is capacity planning. The goal is to build a system that meets your business requirements, both now and in the future. But, for the last 25 years, most of the application development and capacity planning techniques we've developed make the implicit assumption that the size of the application is relatively stable. Though some amount of growth of an application is nothing new (and, therefore, neither is growth of the underlying hardware platform), never before has it been reasonable to design a system that can grow in scope beyond its initial implementation by a factor of ten or more. I don't know about you, but for me, an application that quickly grows by an order of magnitude no longer fits into the "relatively stable size" category. Building a scalable warehouse means creating an environment where this type of rapid growth and rapid change are the norm. Since many of the original underlying premises about slow, stable application growth no longer hold true, we have to modify our approach to capacity planning. There are two main areas of difference. First, traditional non-scalable platforms had several components whose capacities were fixed. (For example, uniprocessors had a fixed CPU capacity.) However, many modern scalable platforms can scale nearly all of their subcomponents (CPU, memory, I/O bandwidth) by an order of magnitude. This adds more degrees of freedom to the capacity planning problem, which means that we have to specify more things than we used to. Also, the ability to change the capacity of nearly all components requires a more in-depth understanding of the complex interactions between all of the system components.

Second, since the needs of your enterprise change so quickly, the system capacity requirements will also change quickly. This requires you to think of capacity planning not as a single event in the application development cycle, but rather as a continuous process that runs throughout the entire life of the warehouse. In other words, capacity planning becomes highly iterative in the scalable world, and each iteration has multiple steps.

This month, I'll discuss the first step of capacity planning: identifying the business drivers that will determine the capacity requirements of your warehouse. Next month, I'll discuss the second step, which involves turning those business requirements into a technical configuration, and I'll also discuss the issue of iterating your capacity planning model.

As previously stated, a basic tenet of capacity planning and system sizing is that the selected configuration provides a technical solution that meets the current and future needs of your business. Therefore, the first step in sizing a scalable system (or any system, for that matter) is determining what the business requirements are for the system. The requirements can be stated in many ways. Some may be higher-level requirements, such as the need to extract data from 10 different operational systems, the need to refresh all data every week, the need to support 350 concurrent users or the need to support 80 percent growth in data volumes next year. Others may be lower-level requirements, such as the need to extract, clean, load, aggregate, index and backup the database in the eight-hour batch window or the need to scan 50MB of data per second.

Each requirement can usually be classified as affecting one or more of four main "business drivers" that will ultimately be used to determine the technical configuration of your system. They are: raw data size, average query complexity, desired query response time and number of concurrent queries

Raw data size can usually be estimated by evaluating the data feeds that are expected to create the warehouse, usually by using the formula "X records * Y bytes per record." The record count and sizes are usually estimates, based upon the representation of the data in the operational/legacy system. It is important to take into account the differences in representations between legacy systems and relational databases. It is usually necessary to create a mapping of legacy data types to RDBMS data types and sizes, using the documentation of each RDBMS to find byte counts for each data type. Small errors introduced in this section tend to be multiplicative and can have a large effect on system sizing.

Average query complexity is a breakdown of the queries into high, medium and low complexity categories. Typically, these are divided by the number of JOINs performed, nested SELECTs and any aggregations, such as SORT or GROUP BY. For example, queries that JOIN four tables, have an embedded sub-query and then perform a GROUP BY on the results would typically be considered HIGH complexity. The categorization is somewhat arbitrary, though any experienced DBA will be able to formulate categories that they are comfortable with.

Desired query response time is an estimate of how long an average query should execute. If a breakdown of queries by category has been done, then each category will have its own desired response time (e.g., high complexity queries in four to six hours, medium in one to two hours and low in 20-40 minutes). These are difficult to estimate and are sometimes arrived at by considering the workload of the business units using the warehouse, estimating how much they will use the warehouse in a given day or week and then working backward to calculate in what time frame the queries must complete. When estimating this, it is important to know when the actual users will be logged in. For example, a warehouse with users only in one time zone will usually have an eight-to-ten hour peak demand cycle. If the same number of users were to be located across the U.S. in four time zones, the same amount of work would have 11-13 hours to complete.

Number of concurrent queries is the average number of queries that will be executing concurrently. This number is usually estimated by typifying the user community, and their data access tools, and determining how much work they need to complete in a given day. As an example, if a warehouse is being used for marketing mailings, and five marketing analysts usually each produce one mailing every week, and each mailing usually requires analysis by querying for 50 different customer segments (each one a separate SQL query), then we can expect about 250 queries per week or roughly six queries per hour. If we have estimated our desired query response time at 30 minutes per query, then we can estimate that at any one time there will be an average of three concurrent queries. If we also assume that four marketing managers will be performing ad hoc reporting on the same warehouse (usually a highly complex, long-running query), it could be estimated that it is necessary to increase this number by one concurrent query. Thus an estimation of the mixed workload would be four concurrent queries.

Because scalable data warehouses are so dynamic, you must continually assess your capacity needs. By keeping a constant vigil over the four main business drivers, you can track how your capacity requirements are changing over time. Next month, we'll look at converting the information gathered regarding these requirement into specific technical configurations.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access