Sid Adelman's Answer: Your cost will be based on many more factors other than just the size of the data warehouse. Your cost will be based on:
- The functions you are delivering
- The number and characteristics of the source files, levels of quality, and requirement for quality.
- The size of the database
- The number of users, both concurrent and total
- What software you buy and support (ETL, BI, data cleansing, DBMS, etc.)
- Your need for consultants and contractors - this will be a function of your skills levels and the schedule for implementation
- Your service level agreements for performance and availability.
- The architecture of your EDW, use of ODS, staging areas, data marts, metadata
- How well you plan to support your users
- What training you intend to provide for your users
- The training required for your IT staff
It's important to include all these costs as you establish a budget for the EDW. If the other costs are not included, they will not be allocated and you will have a substandard EDW or, if the moneys are allocated, you will be criticized for going over budget.
Les Barbusinski's Answer: Whoa! That's a tall order, but here goes. Exact infrastructure requirements and costs are dependent on a lot of factors including:
- The purpose of the data warehouse (e.g., BI vs. data integration vs. compliance, etc.) as well as the functionality that is required (e.g., 24x7 availability, real-time vs. batch data acquisition, etc.)
- Refresh frequencies and latency requirements
- Projected transaction volumes and concurrency requirements
- Service level agreements
- How much of the required HW/SW is already in house
- Shop standards
- The degree of competition between potential vendors
Hence, I can't give you exact system configurations or cost estimates. However, I can give you a list of infrastructure components that you'll need to assess in order to lay the groundwork for your terabyte data warehouse. This includes:
- Application Servers: You'll need a collection of different application servers to host the layers of application services that your data warehouse will need including ETL, Business Intelligence, Reporting, Messaging, Web Portal and Web Services, Business Rules Engines, etc. Each application will have an optimal configuration that you'll need to address. You'll need to size the server(s) for each application layer (i.e. number of servers, number of processors, RAM, etc.) in order to handle the expected transaction volumes. You'll also have to address the need (if any) for clustering, active/passive failover, 24x7 availability, etc. Also, don't forget to plan for different environments (e.g. development, test, and production). Some of these servers will be UNIX, while others will be Windows ... depending on the application.
- Database Servers: You'll need a collection of database servers to handle all of the database layers inherent in a data warehouse (e.g. historical data store, operational data stores, data marts, extract areas, etc.). First, you'll need to decide on how best to distribute your database nodes (e.g. all centralized, de-centralized data marts, de-centralized ODS instances, etc.), then size each node (i.e. CPUs and RAM per node, servers per node, etc.). As with the app servers, you'll have to address such issues as failover, clustering, 24x7 availability, etc. ... as well as make provisions for handling multiple environments. Keep in mind that your choice of RDBMS will - in large part - dictate your database server requirements.
- Disk (SAN) Space: You've already stated that you need to store roughly one terabyte of usable data. Now you need to plan the space requirements for such things as indexes, logs, sort-work areas, staging areas, distributed data marts, extract areas, etc. You also have to make adequate provisions for replication, mirroring, and database backup (i.e. most backups are done to disk then spooled to tape asynchronously). Finally, you'll have to adjust all these estimates to handle the expected growth rate (taking into account whatever archiving scheme you have).
- Tape Drives: Database and file backups from terabyte-plus sized data warehouses are usually spooled to streaming RAID-5 tape arrays. You'll need to assess the frequency and volume of these backups, choose an appropriate model of tape array, plan for an adequate supply of tapes, and insure adequate bandwidth between the servers being backed up and the tape arrays.
- Archiving Devices: The growth of a data warehouse must be controlled in order to assure adequate performance. Aged data that is no longer used, must be periodically culled and archived from the database. Normally, this function is performed with tape drives. However, you may want to consider archiving to a CD-ROM "jukebox" if there is a chance that the data will have to be (a) restored, or (b) retrieved at some point in the future.
- Networks: Obviously, you need to assess your company's LANs and WANs to insure that you have adequate bandwidth for (a) pulling data from source systems into the data warehouse, (b) replicating tables or refreshing materialized views to distributed nodes, and (c) handling query and reporting traffic from end users. You also need to insure that your network is sufficiently redundant to insure that the loss of a single trunk line will not disable your whole system. Also, if any portion of your data warehouse will be exposed to the Internet, you need to assess your need for security devices such as firewalls.
From an infrastructure perspective, all you really need to know about the software is a) the number of licenses that will be needed (remember, three environments ... development, test and production), b) which server(s) they'll be hosted on, and c) their cost (which is usually related to the size of the server they'll run on). Some of the software that you'll need to evaluate, price, and procure for your data warehouse include the following:
- Database (e.g., Oracle, DB2, SQL Server, Teradata, et al)
- ETL/Data Integration (e.g., Ascential, Informatica, et al)
- Business Intelligence (e.g., Microstrategy, Essbase, et al)
- Operational Reporting (e.g., Actuate, Crystal, et al)
- Web Portal (e.g., Plumtree, Sharepoint, et al)
- Messaging (e.g., WebSphere MQ, BEA WebLogic, et al)
- Data Mining (e.g., Clementine, IBM Data Miner, SAS, et al)
- Job Scheduling (e.g., AutoSys, Tivoli, et al)
- Backup (e.g., Veritas, Legato, et al)
Pheeeeew! Hope that helps.