Last month, I looked at the first step of capacity planning for a scalable data warehouse environment, which was to estimate the business requirements that your warehouse must satisfy. Though such requirements can be stated in many ways, they can usually be classified as affecting one or more of the following "business drivers:" raw data size, average query complexity, desired query response time and the number of concurrent queries. This month, I'll look at the second step of the process, which uses these business drivers to determine an actual technical configuration for your platform, and then I'll conclude with a discussion on iterating this configuration to address the issues that arise due to complex interactions between each of the components in your configuration. STEP 2: DETERMINING THE TECHNICAL CONFIGURATION. Having completed Step 1 (Estimating the Business Requirements), we have an understanding of the requirements that need to be met by any potential configuration. The next step is to translate these requirements into a physical, balanced system configuration. This involves balancing five major system components:

Total disk capacity is often defined as the ratio between the raw data size and the total database size. The total database size is always larger than the size of the raw data it contains, because it includes the raw data plus things like indexes, system overhead and the database's internal temporary working areas. However, the database size is not just some simple function of raw data size. It is also driven by the number of expected concurrent queries, because each query requires its own working space. For example, if three concurrently executing queries have sub-queries that each return half of the raw database, then the total database size must be at least 2.5 times the raw size. As a baseline rule of thumb, for most warehouse applications with only a small amount of denormalization, the average ratio of database size to raw data size tends towards 3.0 to 4.0. However, if there is a substantial amount of denormalization, then this ratio could be much higher, with the database easily being 10 or more times the raw data size.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access