Data warehousing isn’t a new idea; conceptually, it’s been around for 30 years. From its initial inception as a method of storing relevant data from relevant systems into a specific application used for query reporting analysis, data warehousing has evolved into a technology that consolidates all information for the organization into one repository. An Enterprise data warehouse consists of multiple subject areas (finance, marketing, sales, etc.) and represents areas of interest for groups and for individuals who examine the data across several subject areas.

What's new, however, are the technologies now available for building an EDW. More and more enterprise organizations are building an EDW stack where several open source solutions are deployed in parallel with traditional products, or even stacks made entirely of open source technologies, ensuring the best price/performance ratios and openness.

How can open source facilitate data warehousing, particularly at the enterprise level? First, let’s look at the technologies involved. A data warehouse comprises several interconnecting layers:

  • The database itself, which collects and manages the data for the data warehouse.
  • Data integration processes that extract, transform and load data from the application layer to the EDW.
  • Business intelligence tools to access information and to report and analyze data.
  • A number of peripheral features and tools, including a metadata layer, management technologies, etc.

By examining the major EDW domains we’ve just highlighted, we can identify some areas of concern and see how open source can alleviate these issues.

Database

A major concern surrounding EDW is the ever-expanding volumes of data stored in the data warehouse. Today, companies have to deal with fast growing, increasingly complex operational systems from which data needs to be processed and analyzed in real time, As the volume of data grows, so does the storage space and the number of processors required to store and process this data.

With the increase of data volumes, the cost of database licenses increases dramatically. This is where open source plays an interesting role. Data warehouses are commoditized now, and open source can dramatically bring down the cost of storing data in the data warehouse. In addition, some database technologies are plug-compatible with traditional, proprietary databases. Other open source databases have proven their high scalability in complex data management. If you add an open source storage engine to certain relational database management systems, you’ll be getting scalability and performance for analytics that are similar to a proprietary database that would cost you a lot more money.

ETL

ETL processes are the most critical - and value added - components of an EDW infrastructure. While generally invisible to users of the BI platform, ETL processes retrieve the data from all operational systems and preprocess it for the analysis and reporting tools. The accuracy and timeliness of the entire BI platform rely on the ETL processes to extract data from the operational system (databases, applications, Web services, etc.), process it (transformation, aggregations, lookups) and load it into the EDW. ETL processes not only access all the data sources needed by the BI and reporting applications, but actually prepare the data so that it can be efficiently processed in the data warehouse, increasing performance and dramatically decreasing response times of queries.

One of the concerns associated with ETL is its ability to access all the operational data of the enterprise whether it is contained in proprietary or open source databases, packaged applications (such as MS Dynamics, SAP, SugarCRM), or applications provided under the SaaS model (SalesForce.com for example), or even through Web services or files. When the price of an ETL tool is based on the number of source and target systems, it’s inevitable that tier two or tier three applications will be ignored because of the licensing cost of the individual connectors.

Open source shines in the connectivity arena because there isn’t a specific cost for connectors, and there are a large number of connectors available. And, in this regard, it is impossible to overstate the value of the community - users can work collectively and create their own connectors if specific needs aren’t currently supported by an out-of-the-box product ETL.

A second domain in which ETL makes is a difference that it doesn’t charge per CPU of the runtime engine, which means that as data volumes increase and there are no additional charges as the need for additional processing power also increases.

BI - Querying, Reporting and Analysis

BI is holding its own and often ranks either first or second in importance of business initiatives. Enterprises need to make the right decisions - to hang on to the right customers, build the right products and target the right geographies. In good times and bad, corporations want to grow sales and cut expenses. BI is the roadmap that promotes good decision-making in these arenas.

Open source BI offers a very compelling argument. Although not free, it’s a fraction of the cost you’d pay for a proprietary solution. And IT departments don’t have to justify significant up-front fees. You can try before you buy - download the software, build a prototype and test it before spending a cent. And the existing skills of the IT implementation team easily transfer to an open source offering.

The cost benefit has another ramification. Budget allocation ratios are distinctly different. While traditional BI implementations often allocate 80 percent of the budget for licensing and maintenance and 20 percent for services, an open source solution reverses this - typically 20 percent of the budget is alloted to licensing and maintenance and 80% toward services. In addition, the open source BI solution is significantly more customized to meet individual customer needs, making it much more flexible and allowing faster return on more optimized investments. A driving factor for customer adoption has been a rebellion against vendor lock-in or unwarranted price increases.

Because virtually every BI implementation requires some kind of interface with legacy systems, open source is a natural fit. Plug-ins are a typical solution, and having access to the source code offers a major advantage. Proprietary models tend to isolate developers from end users.

Open Source Solutions Today

In just a few short years, open source has evolved from something “geeky” into an enterprise-ready solution. However, it’s important to look past the idea that today open source solutions are sufficiently feature-rich to meet user requirements and examine the need to have a vendor who can support and extend these solutions. If you’re building mission-critical systems, you need to select a partner carefully. The key is sustainability. Open source is definitely mature enough and robust enough to handle EDW. The defining acceptance criterion lies in determining which companies (and products) will be here for the long haul - and that will be those with well-developed business strategies surrounding service, support and value-added software products. Successful EDW requires a network of software, hardware, consultants and developers. to support customers and manage the effort over the time it takes to get a reliable ROI from an EDW project.

The community is also a big draw for many customers, and a significant number of open source users would rather call on the community for help addressing issues than get support from a dedicated service. This lets them reduce the cost of support and decrease their data integration budget, and the return they get from the community is comparable in quality to traditional support from a proprietary vendor. Because the development cycle of open source applications is usually quite short, users know that the chances of getting a feature request developed and made available in the next release of an open source application is significantly greater than a similar request in the proprietary domain.

Open source is ready for EDW. Because open source is designed to be modular, an enterprise can start with one piece - say ETL or reporting - and can add on as needed. For comparable power and features an open source solution in this arena can cost 10 to 20 times less than a proprietary product. Whether large or small, companies today are being asked to do more with less. With open source, you can have an EDW without compromise.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access