Last month, I showed how partition elimination allows us to satisfy a query by only having to scan a portion of a table rather than having to perform a full table scan. If you recall, the example we have been using involves a data warehouse that stores information pertaining to foreign automobile sales in the United States. For simplicity, we have assumed our warehouse only has information on Mercedes, Porsche, BMW and Volvo and that it has roughly an equivalent amount of information on each type of car, all of which is stored in a single table called "Car_Sales." Also, we assumed that our hardware platform has four CPUs and four disks. In last month's discussion of partitioning, we partitioned the data by car type, and data relating to each car type was assigned to its own disk. (That is, all Mercedes sales data was on disk 1, all Porsche sales data was on disk 2, etc.) We then examined a query to look at only the BMW sales data and showed how partition elimination allows us to look at only one of the partitions (the BMW partition that resides on disk 3) instead of all four. We were, therefore, able to eliminate the need to scan the other 75 percent of the Car_Sales table.

But, did this use of partition elimination really reduce the response time of the query any more than the response time improvement we observed by just using parallel scans on all the data? Assume for a minute that a full-table scan of the Car_Sales table using a single scan thread (that is, no parallelism) would result in a response time of T1. The scan thread would scan disk 1, then disk 2, etc. If we then add in parallelism and create four simultaneously executing scan threads, each thread would only have to scan a single disk (which contains one-fourth of the data), so the response time would be T1/4. Now by using partition elimination, we can remove the need to create and execute the scan threads for disks 1, 2 and 4. However, the scan thread for disk 3 still must scan one-fourth of the total data, so the response time would still be T1/4, the same as it would be if we performed a parallel scan of all the data.

Clearly, using partition elimination did free up 75 percent of the resources on the machine for other queries which could be processed at the same time, and this is a huge savings in processing power. But, it didn't improve our response time. What if we wanted to reduce the response time of our query and still use only 25 percent of the system resources? To accomplish this, we need to be able to access each individual partition in parallel. That is, we need to combine partitioning with striping, and stripe each partition over each of the disks. This is shown in Figure 1.

If we use this physical layout and execute our query for BMWs again, the database will once again eliminate partitions 1, 2 and 4, but it will now be able to use four scan threads to scan partition 3. Thus, because we only have to scan one-fourth of the data (due to partitioning) and because we can scan the remaining partition using four parallel scan threads (due to striping), our overall response time is 16 times faster than if we had no parallelism and four times faster than if we did not have partition elimination.

The Problem of Data Skew

In the description of our example where we used range partitioning to partition the table by car type, we made one important assumption: we assumed that there was roughly the same amount of data for each type of car, so each partition would, therefore, hold roughly the same amount of data. However, very often this is not the case. When some partitions contain much more data than others, the undesirable result is called data skew. To illustrate the effect of data skew on a data warehouse, let's create a new database which stores information on turkey sales at a supermarket. As our platform, we will use a 12-CPU system with 12 disks. Since we frequently want to look at turkey sales by month, we choose to use the month in which the turkey was sold as the partition key.

In general, partitioning based on the most frequently used key is the correct strategy. However, in this case, it will be disastrous because we didn't take into account the nature of the data itself. A short time after the system is put into production, you would notice that node 11 (containing the November data and, therefore, containing the Thanksgiving turkey sales) is frequently far busier than the other nodes. In fact, all the other nodes hardly get used.

Why would this happen? Because the partitioning strategy we have chosen has caused data skew. Since more turkeys are sold during November than any other month in the year, we have a lot more data on node 11. This means that, on average, node 11 will perform a lot more processing. If your warehouse has data skew, it will not be scalable. Adding more nodes or disk drives will not improve performance, because all of the work will still be performed just by node 11.

There are three options for solving this problem. First, we could combine partitioning and striping, and stripe the November partition across all 12 nodes. (In fact, we could stripe all 12 partitions across all 12 nodes. However, because the other 11 partitions contain very little data, this would likely be overkill.) The second option is to use hash partitioning instead of range partitioning, since hash partitioning is far less likely to cause data skew. However, we will lose the ability to perform partition elimination. The third option is to choose a different partition key for our range partitioning--one that will still be useful for partition elimination, but will not suffer from data skew.

Avoiding Skew by Choosing a Good Partitioning Key

With range partitioning, it's hard to know what will be a good key, so you will have to sample your database to get statistics about data range distributions. And, even then, you may find a partition key that will lead to an even distribution, but it may not be a useful key for partition elimination because it may be a column that is rarely used in the WHERE clause of your queries. So, there is no easy answer. And, to add even more complexity, the choice of the best partitioning key will be an ongoing process, since your data warehouse data will change its characteristics over time.

In addition, I also want to point out that although hash partitioning is less likely to cause skew, you still have to carefully choose which column you want to act as your hash key. It must be chosen so that its values are as unique as possible, because only with a wide variety of input values will you get an even distribution of output values (assuming we have a decent hashing function). For example, a serial number or a social security number is a good choice for a hash key. Often, a product ID is also a good hash key if you have a large number of product IDs. However, if we were to use the month in which an item was sold as our key, we would have all the same data skew issues discussed earlier. Having only 12 possible input values means you would have only 12 possible output values, and this is not nearly sufficient for effective hash partitioning.

Scalable I/O Techniques Enable Organic Growth

When designing any large system, you have many choices regarding how to structure the physical I/O. By using the basic techniques of striping and partitioning, you can gain tremendous improvements in both the response times of your queries and the overall throughput of the system. By understanding the strengths and the limitations of these techniques and by being aware of the common pitfalls, you can build a system that will be able to organically grow as quickly as your organization's needs grow--and still maintain the required performance levels.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access