Continue in 2 seconds

Is there a standard that can be used as a guideline for sizing and capacity?

By
  • Joe Oates, Chuck Kelley, Michael Jennings, Les Barbusinski
Published
  • July 02 2003, 1:00am EDT

Q:

Is there a standard that can be used as a guideline for sizing and capacity? I've been reading many sources of information regarding database and cube sizing but I am not clear on exactly how this is calculated and measured in regards to server and database resources. For example please review the following:

Dim_Time 144 rows 132 char record length = .019 MB
Dim_Org 38 190 = .007
Dim_Services 50 100 = .005
Fact_Table 500,000 200 = 100 MB

Is the database taking up 100.031MB of space? How is the cube space calculated? What impact is there when there are concurrent users? How is this measured?

A:

Les Barbusinski's Answer: In your example, the actual size of your database is probably closer to 200 Mb. The reason is that you have to factor in a) block size, b) free space, c) extent size and d) indexes.

Block size is the physical size of the data block that contains the rows. RDBMSs do not store records individually...they group them into blocks (i.e., "n" rows per block) that are then written to - or retrieved from - the hard disk as a unit. Block size is fixed at the time a database is created and is usually set at 16K, 32K or 64K. Free space is the percentage of each block set aside by the DBA for future insert/update activity. If, for example, free space is set at 20 percent, each block will initially be loaded to 80 percent of its capacity. Extents are the physical units of disk space that are allocated by the operating system (i.e., UNIX or Windows) to specific tables and/or indexes. Extents can be "primary" or "secondary." Primary extents are usually sized to contain all of the rows currently in a table or index with some allowance for short-term growth (say, six months). Secondary extents are incremental blocks of disk space that are allocated to a table or index as the primary extent is exceeded. Indexes are similar to tables in that each requires its own space... based on its row length, free space,and extent settings.

Now let's apply all this knowledge to your example using the following assumptions: 1) block size equals 32k, 2) 20 percent free space and 3) 10 percent short-term growth. Let's further assume that each dimension table has a single 10-byte numeric primary key and the fact table has four indexes: a primary index consisting of the three foreign keys from the dimensions and three individual FK indexes. Given these assumptions, your calculations should look something like this:

Data Space Requirements

  • DIM_TIME: 144 rows x 1.10 (i.e., 10% growth) = 158 rows
    32,768 bytes per block/132 bytes per row = 248 rows per block
    248 rows per block * 80% (i.e., 20% free space) = 198 rows per block
    158 rows/198 rows per block = 1 block (rounded "up")
    1 block x 32,768 bytes per block = 32,768 bytes (primary extent)
  • DIM_ORG: 1 block x 32,768 bytes per block = 32,768 bytes (primary extent)
  • DIM_SERVICES: 1 block x 32,768 bytes per block = 32,768 bytes (primary extent)
  • FACT_TABLE: 4,199 blocks x 32,768 bytes per block = 137,592,832 bytes (primary extent)
  • Total Data: ( (32,768 * 3) + 137,592,832 )/1,024,000 = 134.46 MB

Index Space Requirements

  • DIM_TIME: 144 rows x 1.10 (i.e., 10% growth) = 158 rows
    32,768 bytes per block / 10 bytes per row = 3,276 rows per block
    3,277 rows per block * 80% (i.e., 20% free space) = 2,621 rows per block
    158 rows/2,621 rows per block = 1 block (rounded "up")
    1 block x 32,768 bytes per block = 32,768 bytes (PK index)
  • DIM_ORG: 1 block x 32,768 bytes per block = 32,768 bytes (PK index)
  • DIM_SERVICES: 1 block x 32,768 bytes per block = 32,768 bytes (PK index)
  • FACT_TABLE: 631 blocks x 32,768 bytes per block = 20,676,608 bytes (PK index)
    210 blocks x 32,768 bytes per block = 6,881,280 bytes (FK index #1)
    210 blocks x 32,768 bytes per block = 6,881,280 bytes (FK index #2)
    210 blocks x 32,768 bytes per block = 6,881,280 bytes (FK index #3)
  • Total Indexes: ( (32,768 * 3) + (6,881,280 * 3) + 20,676,608 ) / 1,024,000 = 40.45 MB

So the total amount of space you must allocate for your database is 174.91 MB (i.e., 134.46 + 40.45). Granted, this is an oversimplified example, but you get the idea (hope I got the math right). Your DBA can provide you with specific formulas for estimating space requirements in your shop.

As for estimating cube sizes, that varies by vendor and by platform. Cubes are proprietary and internal architectures differ, so there is no standard formula. Check with your BI vendor.

The number of concurrent users on a database has no effect on storage requirements. It does, however, impact server capacity...as measured by total CPU horsepower (i.e., CPU count x MHz), and RAM. Estimating this impact on server capacity is an art form, not a science. Performance varies by server platform, operating system and RDBMS. The easiest way to estimate your capacity requirements is to benchmark the performance of similar servers in your shop. Observe the server's performance while asking yourself the following questions:

  • How much RAM does each concurrent user session grab and hold?
  • How much RAM is used (min, max, mean) with 10 users logged into the server and one active RDBMS thread? With 50 users? With 100 users? How much RAM is used with 10 users and three active threads? And so on. Plot RAM usage against concurrent users and threads. What does this growth curve look like?
  • What is the CPU utilization (min, max, mean) with 10 users logged into the server and one active RDBMS thread? With 50 users? With 100 users? What is the CPU utilization with 10 users and three active threads? And so on. Plot CPU utilization against concurrent users and threads. What does this growth curve look like?

Once you've ascertained how your chosen server is likely to react to various loads, it's time to estimate your DW application's "load:"

  • How many concurrent users will your DW have to sustain (at low, medium and peak levels of activity)?
  • What is the complexity of the average query that will be run by your users?
  • What is the number and complexity of your ETL processes (especially real-time ETL process that will run during prime time)?

Ultimately, when you apply your estimated "loads" against your server's performance curves, you should be able to make a fairly accurate SWAG at your server capacity requirements.
Hope this helps.

Michael Jennings' Answer: The table below (see attached spreadsheet also) provides a simply method for calculating the size of a dimensional model plus forecasting annual growth. I have inserted the dimension table information you provided in your question into the spreadsheet (the number of rows for the fact table was erroneous). Using the row size, estimated initial rows for dimension tables, predicted annual growth and index size (if used) the spreadsheet will provide sizing estimates for the maximum possible fact table size, dimension tables, and totals for the database. As with all tools, your own specific knowledge and experience should be applied to the final database size estimates.

To view or download the XLS spreadsheet, click here.
(To download, right click and select "Save Target As...")

Chuck Kelley's Answer: Sorry, but I don't know of any guideline for sizing and capacity. However the calculations will depend on the products you use. From the database size of the equation (assuming you are doing this is a relational database), then you calculations are almost correct. The raw data is 100.031mb, but the database will take up more space. There is overhead for each page in the database, each file of the database, system tables and indexes, database logs/journals, and other like items. Generally this is not much, maybe 10+mb will do. Next you need to consider temporary space. Depending on the number of users, I think 100 percent extra space is reasonable (for smaller databases), though when you get in the 100s of gigabytes, the accountants will start screaming. In your case, it probably won't be much of an issue. Next you need to add space for indexes (and surrogate keys if you use them and they were not in your original raw data estimate). I use the "seat of the pants" double the amount of raw data. In your case, that will be another 100mb. Again, this will be dependent on the database vendor and whether you choose to use indexing strategies other than B-Tree indexes (most common index type in databases).

Other things that you will need to consider is whether you will have aggregate (or summary) tables. For the size of your data, they will probably not be needed. But if you decide to build them, you will have to do the calculation for those tables like you did the raw data tables.

Then you will want to consider is the staging area. The staging area is where the data comes into your system. Sometimes this is in the database. Sometimes this is flat files. However, there will be some space that you will need to be able to run the ETL products. You will want to make sure that you have sufficient space to do all of this. Depending on the types of transformations and source feeds you get, this could be quite significant.

Now as for cube sizing, that will be highly dependent on the product that you use and the density of your cube. Your best bet is to take some sample data and have the vendors give you a "seat of the pants" calculation.

Joe Oates' Answer: I don't know what database that you will be using, so I can only give some general observations. I do not see where you have allowed for indexes, sort space, staging area space, etc. You must consider these and perhaps others, when you do database sizing.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access