Every morning in Africa, a gazelle wakes up and knows it must outrun the fastest lion to survive. Every morning a lion wakes up and knows it must outrun the slowest gazelle or it will starve to death. It does not matter if you are a gazelle or a lion. When the sun comes up you hit the ground running.

In the data warehouse world, companies know that they must outrun the competition if they are to survive. Customers need speed and agility and make million dollar purchasing decisions based on a vendor's TPC-D benchmark results. Some vendors have found loopholes in the TPC-D tests and present a "truth" that makes them look like gazelles when, in fact, they are actually "lyin." This article is designed to inform data warehousing customers on what TPC-D benchmarks measure and what they don't.

The Transaction Processing Performance Council (TPC) is a democratic organization founded in the early 1980s. It consists of individuals from a wide range of vendors. They have cooperatively designed a series of tests to measure power, throughput and price/performance for decision support and OLTP computer systems in order to disseminate objective, verifiable data to the industry.

TPC-D benchmarks were designed for systems that examine large volumes of data and execute complex queries. These complexities include sub-queries, table scans, multiple table joins and complex aggregations. The Council sets data volume points of 1 gigabyte (GB), 10GB, 30GB, 100GB, 300GB, 1 terabyte (TB), 3TB and 10TB. Each volume point is considered a different benchmark. Each audited TPC-D result is published as a three-page executive summary (containing 3 primary and 2 secondary metrics) and a full-disclosure report.

The primary metrics are:

TPC-D power: This metric is a query-per-hour rate for a single user. The user runs 17 varying queries and two updates in a serial fashion, alone on the system.

TPC-D throughput: This metric represents a query-per-hour rate for multiple users. Concurrent execution of 17 queries and two update functions are run to measure the total workload supported by the system. Amazingly, this metric can be bypassed, yet its metric is still reported. It is important to check the query time throughput for a single user and the average time for multiple users. Compare the differences to a factor of the number of streams. A zero (0) stream factor means the throughput test was run with a single user. Make sure the stream factor is two or three to see concurrent throughput of two or three users.

TPC-D price/performance: This metric identifies the five-year cost of the system in relation to the composite query-per-hour rating of that system.

The two secondary metrics are:

Load time: This metric represents the time it took to load the test data from a client or mainframe. There are ways to make load times appear faster and easier. I cannot stress enough the importance of loading millions of records quickly and easily. Ask customer references about load experiences.

Disk storage ratio: This metric describes the ratio of the volume point of raw data and the actual total disk space used to run the benchmark. The higher the volume point of raw data, the more likely there may be excessive indexing designed specifically for these tests.

TPC-Ds do not measure:

Ad hoc queries: These are spontaneous queries or requests that are found in real world data warehousing sites. Most Decision Support Systems (DSSs) do not know what questions will be asked tomorrow. Queries mature as users mature.

Multi-user abilities: A system can be tuned to run well with one user but may not perform well with multiple users. Look at the query time throughput for a single user and the average time for multi-users. Compare the differences to a factor of the number of streams and make sure the stream factor is not zero.

Complexity of database setup: Because TPC-D benchmarks use predefined queries, some vendors tune the database and create synthetic indexing to increase the results. Look in the full disclosure report to see the complexity of the database setup in the Data Definition Language (DDL). Some vendors have used close to 400 pages of commands to set up eight tables. In the real world, this could take weeks or months. Other vendors require less than two pages to set up the same tables.

Why some vendors aren't published: If a data warehouse vendor has not published TPC-Ds at a certain volume point, they usually can't!

Proof of production environment success: Finding successful production site references with similar designs is extremely important.

Comparisons to different volume point results: A TPC-D benchmark at 300GB should be reasonably comparable with a benchmark at 100GB to show linear scalability. Most data warehouses double in size each year.

TPC-D numbers provide a measuring stick for the industry. You can see TPC results on the World Wide Web at http://www.tpc.org. When purchasing a Decision Support System check the TPC-D benchmark metrics, but also take a close look at the detail. Look through the camouflage, because it's a jungle out there.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access