How do you tune a data warehouse? In a word, you don't, or at least you don't tune it very finely.
This statement contradicts the existence of entire departments of systems programmers who spend their working lives tuning the online transaction processing (OLTP) environment. Along comes a data warehouse and repudiates their entire profession. How can that be?
Consider the job of the systems programmer/tuner in the OLTP. The systems programmer first understands the OLTP workload. Once the workload is understood, the systems programmer adjusts the system for an optimal execution of that workload. He looks for such things as the Monday-to-Friday transaction rate for peak periods of the month and for the length of the overnight batch window. He uses tools such as extrapolations, projections and capacity planning estimates. To his credit, the systems programmer often gets it right. He manages to squeeze all the data out of the computer when the job of tuning is done properly.
The OLTP Environment
Enter the data warehouse decision support system (DSS) environment. In the data warehouse DSS environment there is a big gotcha: the workload is perpetually changing. The only time the workload settles down to a state of stability is when the data warehouse isn't being used. It is as simple as that.
What happens when a system programmer tries to apply his witchcraft to the data warehouse environment? Sure enough, he can optimize the throughput through the environment. However, just as soon as he has applied the changes, the workload changes again. The systems programmer makes a second set of changes just in time for the workload to change again.
Does this mean capacity planning in the data warehouse environment should not be done for the data warehouse? On the contrary, it is important to keep up with capacity planning. Does it mean that obvious bottlenecks of monstrous proportion in the data warehouse environment should be ignored? Of course not. If part of the infrastructure is causing gross distortions, then it needs to be addressed. But does that mean large studies, calibrated predictions and detailed buffer pool analysis need to be done for the data warehouse environment? Such analyses are a waste of time in the data warehouse DSS environment.
In the same vein, when an OLTP systems programmer finds some inefficiency in the processing done by the system, he has a set of tools to remedy the problem, such as:
- Increasing buffer size,
- Reprioritizing transactions,
- Rescheduling transactions,
- Parallelizing a work flow,
- Adding indexes, and
- Adding hardware or software.
Ironically, all of these classical OLTP fixes (with the exception of adding hardware or software) are pretty much useless in the data warehouse environment. When it comes to improving performance, the data warehouse systems programmer has an entirely different set of tricks. A data warehouse systems programmer may:
Remove dormant data. Dormant data slows everything for everybody. Getting rid of dormant data is the number-one technique for improving performance for large data warehouse environments.
Create data marts. Data marts are created when many people access the detailed data in the data warehouse in the same manner. By moving the data to a data mart, the detailed data that comes from the data warehouse can be reshaped, summarized, indexed and restructured. Fact tables and dimension tables can be created. By moving data mart data from the data warehouse to the data mart, the workload against the data warehouse decreases dramatically.
Move data mining and exploration to a data mining/exploration data warehouse. When there is a significant amount of statistical processing going against a data warehouse, it makes sense to move the data that is being analyzed to another facility. There the data can be reshaped to better meet the needs of the individual doing the analysis. By moving the statistical processing to another environment, a heavy load is lifted from the data warehouse workload.
Create an occasional summarization or aggregation. Because a data warehouse serves an eclectic community, it is dangerous to do summarization or aggregation in the data warehouse. When you do summarization to the benefit of one organization, you do so at the expense of everyone else. Occasionally everybody (or nearly everybody) in the organization looks at data in the same light where the data is summarized or otherwise aggregated. Usually the level of summarization is low, not high. In these rare instances, it makes sense to summarize or aggregate data that is commonly used that way.
Remove data that is at too low a level of granularity. Most data is not at a low level of granularity. However, sometimes data needs to go into a data warehouse that is at too low a level of granularity. Consider clickstream data from the Web environment, for example. It almost never makes sense to enter clickstream data directly into the data warehouse. Instead, the data passes through a granularity manager to adjust the level of granularity for use in the data warehouse.
These are but a few of the techniques used by a systems programmer to improve performance in the data warehouse environment. Note that these techniques bear no resemblance to the classical techniques used by an OLTP systems programmer.
The next time someone says, "Let's tune the data warehouse," please raise an appropriate question or two.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access