I’m in charge of defining the administration tasks of a data warehouse application, and I was asked to define a method to determine the growth of the data warehouse (capacity planning). How can I achieve this in the most accurate way?


Sid Adelman’s Answer: It’s very difficult to plan for the growth of the data warehouse.

  1. You don’t know how many active users you will have. There will be some estimates, but some of those users will almost never launch a query and others will have to have their hands pried off the keyboard at day’s end.
  2. You won’t know the types of queries they will run. Will the query access five rows or will it join three 20-million row tables? If there is no charge back, the rule of thumb is to double your CPU estimates because those unbridled users will not be concerned about resources used.
  3. You won’t know how much data will be in the data warehouse. If it’s successful, it’s sure to grow but in which directions? Will the users want more historical data? How many years back? Will they want quarterly, monthly or daily data? Will you be bringing in other data that has, heretofore, not been considered?
  4. You won’t know what the DBAs will do to tune the warehouse. Expect more indexes and more summary tables.

So what’s a capacity planner to do? You will estimate the size of the database from the raw data that will be loaded. Some organizations have used a 5X factor to estimate disk requirements. If you have 100 gig of raw data, buy 500 gig of disk space to support the indexes, the summary tables and the workspace. The hardware vendors should be able to offer some assistance, but be sure to understand all their assumptions as they configure a system that is sure to support you (and the marketing rep) for years to come.
As you develop an estimate of the number of users, try to establish a profile of the users – the power users and those who will only use canned queries and reports – and learn when they will be launching their queries (e.g., only at month end, primarily from 8 a.m. to 10 a.m. Monday mornings, insomniacs working from home, running "the query that ate Cleveland" at 2:00 a.m., etc.). Assume a 50 percent increase in the number of users you will have if the data warehouse is a success – if the data warehouse is a failure, almost no one outside of IT will be using it, and you won’t have a job anyway – and design an architecture that will allow you to grow without having to change any of the fundamental pieces of the system, specifically the server, the operating system and the RDBMS. Don’t forget the network. You may be sending quite a few bits up and down the pipes (get your network person involved).

Measure right from the moment the system is in operation. Discover what assumptions were wrong – let your boss know that some estimates will be wrong – and learn how to estimate more accurately for the next phase. Capture what you have learned and make this information available to others who will be estimating capacity.

Chuck Kelley’s Answer: One way is to capture how much data is going into the data warehouse and how it is being used (including by whom). You can keep statistics from your ETL tool (or your hand- written programs!) in terms of loading, you can keep statistics from your DBMS (hopefully) on usage. There are also products on the market that can aid in the task. Pine Cone Systems (www.pine-cone.com) is one such company.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access