What's new, however, are the technologies now available for building an EDW. More and more enterprise organizations are building an EDW stack where several open source solutions are deployed in parallel with traditional products, or even stacks made entirely of open source technologies, ensuring the best price/performance ratios and openness.
How can open source facilitate data warehousing, particularly at the enterprise level? First, lets look at the technologies involved. A data warehouse comprises several interconnecting layers:
- The database itself, which collects and manages the data for the data warehouse.
- Data integration processes that extract, transform and load data from the application layer to the EDW.
- Business intelligence tools to access information and to report and analyze data.
- A number of peripheral features and tools, including a metadata layer, management technologies, etc.
By examining the major EDW domains weve just highlighted, we can identify some areas of concern and see how open source can alleviate these issues.
Database
A major concern surrounding EDW is the ever-expanding volumes of data stored in the data warehouse. Today, companies have to deal with fast growing, increasingly complex operational systems from which data needs to be processed and analyzed in real time, As the volume of data grows, so does the storage space and the number of processors required to store and process this data.
With the increase of data volumes, the cost of database licenses increases dramatically. This is where open source plays an interesting role. Data warehouses are commoditized now, and open source can dramatically bring down the cost of storing data in the data warehouse. In addition, some database technologies are plug-compatible with traditional, proprietary databases. Other open source databases have proven their high scalability in complex data management. If you add an open source storage engine to certain relational database management systems, youll be getting scalability and performance for analytics that are similar to a proprietary database that would cost you a lot more money.
ETL
ETL processes are the most critical - and value added - components of an EDW infrastructure. While generally invisible to users of the BI platform, ETL processes retrieve the data from all operational systems and preprocess it for the analysis and reporting tools. The accuracy and timeliness of the entire BI platform rely on the ETL processes to extract data from the operational system (databases, applications, Web services, etc.), process it (transformation, aggregations, lookups) and load it into the EDW. ETL processes not only access all the data sources needed by the BI and reporting applications, but actually prepare the data so that it can be efficiently processed in the data warehouse, increasing performance and dramatically decreasing response times of queries.
One of the concerns associated with ETL is its ability to access all the operational data of the enterprise whether it is contained in proprietary or open source databases, packaged applications (such as MS Dynamics, SAP, SugarCRM), or applications provided under the SaaS model (SalesForce.com for example), or even through Web services or files. When the price of an ETL tool is based on the number of source and target systems, its inevitable that tier two or tier three applications will be ignored because of the licensing cost of the individual connectors.
Open source shines in the connectivity arena because there isnt a specific cost for connectors, and there are a large number of connectors available. And, in this regard, it is impossible to overstate the value of the community - users can work collectively and create their own connectors if specific needs arent currently supported by an out-of-the-box product ETL.
A second domain in which ETL makes is a difference that it doesnt charge per CPU of the runtime engine, which means that as data volumes increase and there are no additional charges as the need for additional processing power also increases.
BI - Querying, Reporting and Analysis
BI is holding its own and often ranks either first or second in importance of business initiatives. Enterprises need to make the right decisions - to hang on to the right customers, build the right products and target the right geographies. In good times and bad, corporations want to grow sales and cut expenses. BI is the roadmap that promotes good decision-making in these arenas.
Open source BI offers a very compelling argument. Although not free, its a fraction of the cost youd pay for a proprietary solution. And IT departments dont have to justify significant up-front fees. You can try before you buy - download the software, build a prototype and test it before spending a cent. And the existing skills of the IT implementation team easily transfer to an open source offering.
The cost benefit has another ramification. Budget allocation ratios are distinctly different. While traditional BI implementations often allocate 80 percent of the budget for licensing and maintenance and 20 percent for services, an open source solution reverses this - typically 20 percent of the budget is alloted to licensing and maintenance and 80% toward services. In addition, the open source BI solution is significantly more customized to meet individual customer needs, making it much more flexible and allowing faster return on more optimized investments. A driving factor for customer adoption has been a rebellion against vendor lock-in or unwarranted price increases.









