Continue in 2 seconds

What activities need to be done in capacity planning for a data warehouse consolidation project?

  • Sid Adelman, Tom Haughey
  • February 07 2006, 1:00am EST

Q: I've been asked to do capacity planning for a data warehouse consolidation project. I am new to this area - can you tell me what activities need to be done and how to proceed?

Sid Adelman's Answer:

It's amazing how many projects do not consider capacity planning as they jump into an architecture that might not perform. The following is excerpted from the chapter on Performance in my book Data Strategy.

The first step in capacity planning is to define just how much performance is needed so ask the client for their performance requirements. The method of asking the questions and the subsequent negotiations are critical to truly understanding those requirements. If there is no difference in price between the very fast and the moderate response-time option, clients will invariably choose the faster of the two.

The client generally has some implicit performance expectations. These expectations are usually derived from previous experience or, more often, from their current experience with their own PCs running PC applications. PC response time is usually very good and almost impossible to match with any system that requires interacting with another computer across a network, regardless if this interaction is through a LAN or a remote link. User expectations must be explored; they cannot be overlooked.

Excellent performance is not free. It costs more to deliver sub-second response time than it costs to provide three seconds response time. When clients are told that the sub second response time option will cost an additional $2 million, they might opt for the three seconds option. Clients can be given more than one option that will allow them to weigh price and performance.

In some cases, the client is provided the type of information needed to make an intelligent decision. This process can only be used if IT is willing to follow these negotiations with service level agreements (SLAs) between IT and the clients. IT can only provide price/performance options if they fully understand what they are capable of delivering and know the costs of these options.

No system of any size should be implemented without first having a rough idea of how well it will perform. It is very expensive to make changes late in the development life cycle. Many of the problems that would eventually be experienced can be predicted early in the cycle by design reviews, performance modeling, benchmarking and other user experiences.

To effectively model, certain information is vital such as:

  1. The number and complexity of transactions anticipated,
  2. The number and complexities of queries anticipated, and
  3. The number of users.

Performance can be anticipated a number of ways. The techniques to be used will be determined by the importance of the system, the performance requirements, the size of the system, and the organization's interests and skills in performance. The following techniques can be used:

  1. If an application package is being implemented, the experience of other companies can be used to extrapolate.
  2. Reviewing the design of the system, the database and the application.
  3. Using an analytical (performance modeling) tool to predict performance.
  4. Simulating the system with a facility.
  5. Benchmarking the system - this involves writing code, executing the code in an environment similar to the target production environment and measuring the results.

Performance Modeling - Performance modeling facilitates evaluating multiple courses of action without the cost and disruption of actually implementing the options. Modeling is generally performed during the initial design but can also be effectively used when evaluating multiple tuning alternatives such as the impact of different types of disks. Modeling is not a free activity and will require the involvement of a skilled performance person.
Modeling can provide important information without causing the organization to expend the money and people resources to experiment with options that may not significantly improve performance.

IT is constantly changing, but one thing that does not change is the misguided assumption that the new change will not measurably affect the performance of the existing system. This change could be a new release of the DBMS, a new release of the application software, some hardware change, or some additional workload on the hardware. What happens, too often, is that the change causes poor performance in the existing operational systems. The effect was not anticipated, and IT goes into a traditional dance to fix the problem. It would have been far better and much less disruptive if the impact of the change was anticipated. While modeling will not catch all the problems, it should anticipate the majority of problems, and thus shield the changes from the users. Modeling can help the reputation of IT and help it meet the performance SLAs by minimizing the number of performance problems caused by system changes.

Benchmarks are expensive, take time and effort from personnel important to the organization, and could delay the start of a data warehouse project. A benchmark could even result in lost momentum on a project. The effort and activity required by a benchmark could be profitably spent on other endeavors so don't start one unless you must. However, you must if:

  • You expect your system to grow above a level that could cause performance problems. That level could include growth in data volume, number of concurrent users, complexity and size of the workload, or situations where the ETL process could have problems such as short windows for load or update.
  • You established mandatory criteria for product selection that includes the ability to handle your proposed volumes with specified service level agreements. Mandatory means that if the product does not satisfy that criteria, it will not be chosen regardless of function or any other satisfying criteria.
  • Your growth could add to the cost of the system to the point where the project loses its cost justification.
  • You have no guarantees of how you will grow and expand (a merger may be in the offing) and you want to be sure the system you choose will scale.
  • Management is leaning toward a solution whose capabilities you question. They don't believe or understand your concerns and the only way to convince them of the problems is by insisting on a benchmark.
  • Management questions the proposed configuration's ability to support the expected workload.
  • Management believes a modest (and insufficient) configuration will do the job.
  • You want to determine which vendor's product(s) will best support your requirements.
  • You already have purchased and are running with a vendor's product but a major new application is on the drawing board and you want to determine the impact of this new application.
  • The more cautious people in the organization want to over configure to be sure you will be getting adequate performance. This over configuration is expensive and you don't want to buy more hardware than you need.
  • You may have concerns that, to keep their price competitive, the vendors have lowballed the proposed hardware with the expectation that you will buy more hardware when it becomes apparent that additional computer resources are needed.

Service Level Agreements Your users expect some level of service and most scope agreements include SLAs. There is always a tradeoff; SLAs that are demanding will cost more and will take more IT effort. Whoever is responsible for establishing the SLAs must be aware of the costs, the time and the effort to satisfy the SLAs.

Response Time Because data warehouse queries can be extremely variable in the resources they use (the query may access ten rows or ten million rows or join two tables with ten million rows each), response time SLAs should be established for specific, identified and known queries. If an upper limit is given for response time, it should be done so for 95 percent of the queries, e.g., 95 percent of the queries should come back in 10 seconds or less; there will always be one or two that take multiple minutes, not seconds - remember The Query that Ate Cleveland.

Availability There are three dimensions to availability in a data warehouse but only two are relevant to the benchmark.

  1. Scheduled hours - for example, the data warehouse may be scheduled for access eighteen hours/day, six days/week. This dimension is relevant if a load impacts the schedules hours.
  2. Availability during scheduled hours and days - for example, during scheduled hours, the SLA for availability is 99 percent. This cannot be determined in the benchmark but can be inferred by checking references - be sure to ask the references what availability have they been experiencing?
  3. Time to recover - You will want to test what happens when there are failures (not if there are failure). The benchmark should include demonstrating recovery processes when there are hardware failures (disk and CPU), operating system problems, database corruption, query failure, etc. In the case of a system failure how long does it take to recover to give the users access to the system. The benchmark should provide information on performance degradation during recovery including response time and number of concurrent users. Tell the vendor what you want to fail. Give them a few scenarios; they can always cause a hardware failure. If they can't, bring along an axe, they will get the idea.

Load Time If the proposed architecture precludes access during load, the time to load is very important in determining scheduled availability and should be a critical success factor for the benchmark results. If the queries can be run during the load, the impact of the load on the concurrent usage and query response time should be measured.

Be sure you know beforehand what you are testing. You will want to monitor, measure and document the following:

ETL process -

  • Time and effort to transform the data.
  • Time and effort to create indexes.
  • Time and effort to load the full set of data.
  • Database tuning.
  • Problems experienced.

Queries -

  • Response times.
  • Number of concurrent users.
  • Workload mix.
  • Machine resources used.
  • Problems experienced.

System availability during the benchmark -

  • What was the recovery process?
  • Do the queries have to be resubmitted and how were they resubmitted?
  • What recovery techniques were used?
  • Query performance during recovery?
  • Does a load recovery cause any aberrations in the data warehouse database?

These are the rules that should govern your approach to anticipating and planning for good performance:
Rule 1: The most critical component to achieving good performance is properly setting user expectations.

Rule 2: Good performance will only be achieved by diligent attention and concern for good performance.

Rule 3: No system of any significant size should be implemented without first understanding how it will perform.

Rule 4: Performance will never be optimal on the first implementation. There is always an opportunity to improve.

Rule 5: Performance should always be identified as a critical success factor in any important system

Tom Haughey's Answer:

I'll answer this by identifying major areas that significantly affect capacity requirements.

Volumetrics. For each original business intelligence (BI) application or system, get data volumes, sizes, number of users and level of concurrency, both current and projected.

History. Consider not only base data (facts and dimensions) but changes to base data over time and the types of changes. What level of history you need is a key business decision. If you keep too much history, you have a space and performance problem, which can be managed. If you do not keep history, or do not keep the right kind of history, you have a very costly redesign and possibly replatform problem.

Growth. Work to determine expected growth in data volume, users, sources, etc. New sources, such as adding new countries or companies to your data warehouse, can significantly increase capacity requirements.

Concurrency. Concurrency deals with the number of users that will simultaneously be running queries. The level of concurrency is important to understand, and the expected growth rate within it. Consider this scenario: You design a DW for 200 users. Then it grows to 500. Still fine, a little slower, but still good. Then it grows to 1000. Getting shaky. Then it grows to 3000. You're dead. You cannot let this happen. Plan for maximum usage over time. The initial installation does not have to cover maximum usage. That's where scalability comes in, so pick a solution that is scalable and you will be able to add resources to accommodate concurrency growth.

Database. In determining data storage requirements, consider that DW databases consist of three equal parts: one-third base data, one-third aggregates and one-third indexes. Also, what is the expected availability of the warehouse? If 24x7 is required, the database may be larger if you are expected to support "trickle load," which is loading new while people are querying existing data. Consider two database factors: data size and complexity. It is likely that the consolidated data model will be larger than the original decentralized models. It should also be expected to contain data that is wider and deeper than the original models. Consider this.

Queries. Understand the type and complexity of queries. Complex queries will require a powerful engine (which affects CPU capacity and power), and may require pre-calculated data to improve their performance (which affects aggregate estimates). Remember to consider how much RAM you will need. RAM is about 200x faster than disk storage, but it is about 80x more expensive than disk storage per megabyte.

ETL (Extract, Transform and Load). ETL has its own requirements, especially for metadata and for staging data. ETL tools can have large storage and CPU requirements. These are part of the overall environment.

BI Tools. The same is true of these.

Metadata and Modeling. Data modeling, mapping and BI tools all require metadata, which will add to capacity requirements in different areas.

I'm sure there are some other areas, but this will give you a reasonable start.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access