It's obvious that a warehouse that does not meet the performance requirements of end users will not be successful. But, there is often more to meeting the performance requirements than many people realize because your warehouse will scale up over time, continually placing increasing requirements on the system. Therefore, not only must the warehouse perform well when it is first built, but these performance levels must be maintained as the warehouse grows. In reality, it is not very difficult to double the amount of data in your warehouse ­ you can simply add more disk drives and fill them up with data. What is difficult is to ensure that your response times stay roughly constant even though you're doubling the amount of data or the number of users. Successfully maintaining performance levels in the face of rapid growth is the essence of scalability; and to be able to do this, we need to build what we call the Performance Assurance Test Environment. This is essentially a series of benchmarks that are continually used to both test and predict the performance of your warehouse.

Defining the Performance Assurance Metrics

The first step in creating the Performance Assurance Test Environment is defining the performance metrics that are most important to the success of your warehouse. The exact list of metrics will differ for each warehouse, but some common metrics are shown in Figure 1.

Extract Times How many sources are we extracting from? How much data comes from each? How much time do we have to complete the extract?
Data Scrubbing and Transforming Times What kinds of data cleansing must be performed on the data coming from the various sources, and how complex is this cleansing? How much time do we have to complete this cleansing?
Load Times How much data must be loaded and at what intervals? How fast can data be loaded into our warehouse? How much time do we have for our load batch window?
Query Response Times What different categories of queries do we have (such as high, medium and low priority)? What are the required response times for each (such as seconds, minutes or hours)?
Backup Times How much data will we back up, and how often will we back up our warehouse? How fast can our database perform backups? How much time do we have to complete the backup?
Recovery Times What amount of downtime is allowable? How fast must we be able to recover from a software failure or a disk drive failure?
Growth Rates: Data How much data do we expect to have initially and in 6, 12, 18 months, etc.?
Growth Rates: Users How many users will we have initially and in 6, 12, 18 months, etc.? What types of users will we have, and what types of work loads will they generate?
Growth Rates: Number of Queries How many queries of each type do we have initially and in 6, 12, 18 months, etc.? How many will be running simultaneously? How will workload patterns be changing over time?
Figure 1: Common Performance Assurance Metrics

By defining these metrics, you can then build tests to determine whether or not your warehouse meets these requirements. (Clearly, if you haven't defined your requirements, it's impossible to determine whether or not you've satisfied them.) You can run these tests before delivering the next incremental phase of the warehouse to your users to ensure that performance requirements will be met and that there will be no surprises when the next incremental phase goes into production usage. Note that the last three items in this table are very important, as they take into account the scalability aspect of your warehouse. Remember, it's important to understand the workload your warehouse must support initially and to also think about how the demands will be growing over time. These growth rate estimates enable you to proactively test your warehouse against future demands to determine if there is a scalability bottleneck in your system before your users find it.

Before I discuss actually building the tests to determine your warehouse's performance (I'll discuss this next month), we must address one other issue: We know what metrics we have to define, but how do we determine the actual values for these metrics? What makes this question even more difficult is that a warehouse is usually a new application (rather than a rewrite of an existing application). There is very little data that you'll be able to collect to help you determine user loads, query workloads/types of queries or general usage patterns. And, users are notorious for incorrectly predicting what data they'll need and what types of queries they'll execute.

How then do we assign values to these metrics? My less-than-wonderful answer is: Make an educated guess. Since your warehouse is an organic system, workload patterns will be changing, and people will discover new ways of using it that they never initially envisioned. This implies two things. First, you must build and refine your warehouse incrementally. Since usage patterns will grow and change, you will need to constantly enhance your warehouse. Just as a growing plant needs continual care and feeding, so too will your warehouse. Do not think of this in a negative light, such as "You mean to tell me that I'll never be finished building my warehouse?" Think of this as a positive aspect ­ by continually iterating through incremental development cycles for your warehouse, you ensure that your warehouse remains capable of meeting the changing needs of your organization. (The other alternative is to let your warehouse become static, which guarantees that it will become your next legacy system.)

More importantly, the second implication is that end-user feedback after each development iteration is critical. You will have made some educated guesses about what levels of performance would be needed, and you now need to validate them with your users. You do this by talking to the end users and by monitoring your warehouse usage patterns. These two forms of feedback are important throughout the entire life of your warehouse, as they are used to continually refine the Performance Assurance metrics and to continually ensure that your warehouse meets performance expectations. Next month, I'll discuss building a Performance Assurance Test Environment to test the metrics we've defined here, as well as to test a few other areas critical for scalability.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access