OCT 14, 2004 1:00am ET

Related Links

Which CDC method is the best to achieve staging database with changed data?
March 7, 2008
Apart from bloated dimension, what are the negatives of using all known attributes in your SCD?
March 7, 2008
When is it better to have normalized data to create data marts and when is it better to have dimensional data?
March 7, 2008

Web Seminars

Data Replication for Real-time (Big) Data Warehousing
Available On Demand
Improving your Overall Analytical Environment by Migrating to a New Data Warehouse Platform
Available On Demand
The Dynamic Duo of Data Warehousing and Real-Time Streams
Available On Demand

When planning for an enterprise data warehouse what types of investment should be planned for in the way of technology infrastructure?

Print
Reprints
Email

Q:  

When planning for an enterprise data warehouse, what types of investment should be planned for in the way of technology infrastructure - mainly computer hardware and software platforms (servers, databases, storage, OS, etc...)? Let's say this is a 1 terabyte EDW to start, please estimate requirements and rough costs.

A:  

Sid Adelman's Answer: Your cost will be based on many more factors other than just the size of the data warehouse. Your cost will be based on:

  1. The functions you are delivering
  2. The number and characteristics of the source files, levels of quality, and requirement for quality.
  3. The size of the database
  4. The number of users, both concurrent and total
  5. What software you buy and support (ETL, BI, data cleansing, DBMS, etc.)
  6. Your need for consultants and contractors - this will be a function of your skills levels and the schedule for implementation
  7. Your service level agreements for performance and availability.
  8. The architecture of your EDW, use of ODS, staging areas, data marts, metadata
  9. How well you plan to support your users
  10. What training you intend to provide for your users
  11. The training required for your IT staff

It's important to include all these costs as you establish a budget for the EDW. If the other costs are not included, they will not be allocated and you will have a substandard EDW or, if the moneys are allocated, you will be criticized for going over budget.

Les Barbusinski's Answer: Whoa! That's a tall order, but here goes. Exact infrastructure requirements and costs are dependent on a lot of factors including:

  • The purpose of the data warehouse (e.g., BI vs. data integration vs. compliance, etc.) as well as the functionality that is required (e.g., 24x7 availability, real-time vs. batch data acquisition, etc.)
  • Refresh frequencies and latency requirements
  • Projected transaction volumes and concurrency requirements
  • Service level agreements
  • How much of the required HW/SW is already in house
  • Shop standards
  • The degree of competition between potential vendors

Hence, I can't give you exact system configurations or cost estimates. However, I can give you a list of infrastructure components that you'll need to assess in order to lay the groundwork for your terabyte data warehouse. This includes:

Hardware

  • Application Servers: You'll need a collection of different application servers to host the layers of application services that your data warehouse will need including ETL, Business Intelligence, Reporting, Messaging, Web Portal and Web Services, Business Rules Engines, etc. Each application will have an optimal configuration that you'll need to address. You'll need to size the server(s) for each application layer (i.e. number of servers, number of processors, RAM, etc.) in order to handle the expected transaction volumes. You'll also have to address the need (if any) for clustering, active/passive failover, 24x7 availability, etc. Also, don't forget to plan for different environments (e.g. development, test, and production). Some of these servers will be UNIX, while others will be Windows ... depending on the application.
  • Database Servers: You'll need a collection of database servers to handle all of the database layers inherent in a data warehouse (e.g. historical data store, operational data stores, data marts, extract areas, etc.). First, you'll need to decide on how best to distribute your database nodes (e.g. all centralized, de-centralized data marts, de-centralized ODS instances, etc.), then size each node (i.e. CPUs and RAM per node, servers per node, etc.). As with the app servers, you'll have to address such issues as failover, clustering, 24x7 availability, etc. ... as well as make provisions for handling multiple environments. Keep in mind that your choice of RDBMS will - in large part - dictate your database server requirements.
  • Disk (SAN) Space: You've already stated that you need to store roughly one terabyte of usable data. Now you need to plan the space requirements for such things as indexes, logs, sort-work areas, staging areas, distributed data marts, extract areas, etc. You also have to make adequate provisions for replication, mirroring, and database backup (i.e. most backups are done to disk then spooled to tape asynchronously). Finally, you'll have to adjust all these estimates to handle the expected growth rate (taking into account whatever archiving scheme you have).
  • Tape Drives: Database and file backups from terabyte-plus sized data warehouses are usually spooled to streaming RAID-5 tape arrays. You'll need to assess the frequency and volume of these backups, choose an appropriate model of tape array, plan for an adequate supply of tapes, and insure adequate bandwidth between the servers being backed up and the tape arrays.
  • Archiving Devices: The growth of a data warehouse must be controlled in order to assure adequate performance. Aged data that is no longer used, must be periodically culled and archived from the database. Normally, this function is performed with tape drives. However, you may want to consider archiving to a CD-ROM "jukebox" if there is a chance that the data will have to be (a) restored, or (b) retrieved at some point in the future.
  • Networks: Obviously, you need to assess your company's LANs and WANs to insure that you have adequate bandwidth for (a) pulling data from source systems into the data warehouse, (b) replicating tables or refreshing materialized views to distributed nodes, and (c) handling query and reporting traffic from end users. You also need to insure that your network is sufficiently redundant to insure that the loss of a single trunk line will not disable your whole system. Also, if any portion of your data warehouse will be exposed to the Internet, you need to assess your need for security devices such as firewalls.
Filed under:

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.