Business Intelligence in the Clouds
InfoManagement Direct, March 5, 2009
The notion of tackling the compute- and network-intensive demands of data warehousing using the seemingly lightweight services of cloud computing may sound counterintuitive. But todays cutting-edge data warehousing technologies actually resemble cloud computing infrastructure and services. Consider, for example, how both employ shared-nothing grids that can easily scale out to add capacity, how their similar service-oriented architectures provide plug-and-play data sources, that both possess rich Web 2.0 analytics and visualization capabilities. The cloud is, in fact, a natural platform for the future of data warehousing. Cloud computing based on pay-as-you-go hardware infrastructure like Amazon or Google offers business intelligence users new data warehouse options that bring unlimited scalability without traditional data center overhead and budget constraints.
Wikipedia says that cloud computing is a general concept that incorporates software as a service (SaaS), Web 2.0 and other recent, well-known technology trends, in which the common theme is reliance on the Internet for satisfying the computing needs of the users. Cloud computing arose as companies like Amazon and Google recognized that the excess capacity of their immense infrastructure could be sold to others for a profit. Cloud computings pay-as-you-go business model is analogous to public utilities where consumers pay only for the energy or water they use. The cost of the underlying infrastructure is incorporated into the units of consumption such as storage and CPU time.
Advertisement
Another flavor of cloud computing combines the computing resources with applications, such as Salesforce.com, Google Apps and Microsofts cloud-based SharePoint and Exchange offerings.
Some kinds of database architectures are better suited for cloud computing than others. Since data warehousing typically involves large data volumes, scalability is key. The cloud provides a virtually unlimited pool of computing power, storage and memory. However, these resources are delivered in discrete modules. Each node consists of standard units of processing power, storage space and memory. While the amounts may vary (by service provider, price point, etc.) and they may increase over time, the clouds pool of resources is a large grid of interchangeable, industry-standard computing resources. Achieving true scalability requires a database architecture that can fully maximize this pool of resources. A shared nothing, massively parallel database architecture is particularly designed to take advantage of multiple units of computing resources. Heres why:
Data Warehousing in the Cloud
Customers have lots of cloud-based data warehouse options. From pure SaaS or DaaS (data-as-a-service) offerings which provide a full software stack, to PaaS (platform-as-a-service) and IaaS (infrastructure-as-a-service) solutions on which you can build your own data warehouse, the cloud has very quickly shown to be fertile ground for managing increasingly large volumes of data.
Several SaaS or DaaS providers offer data analysis services. Using these services, you can employ powerful, full-stack data warehousing technologies with little effort and pay just for what you use. Companies such as 1010data, LogiXML and LucidEra offer various focused solutions for everything from a really big spreadsheet in the cloud to frameworks with built-in extract, transform and load and dashboards all the way through fully developed analysis tools customized for different verticals.
These solutions require no large outlay to get started. You can sign up using a Web browser, in some cases with a free trial, and start uploading data right away. These full-stack solutions include ETL tools for migrating data and automatically building out visualizations to slice and dice your data.
If you need full control over your data warehousing, or the volumes are larger than SaaS providers can handle, you have the option of rolling your own. If building a data warehouse sounds daunting, building one in the cloud would seemingly only complicate matters. But, in fact, the cloud is simple by comparison.
PaaS providers, such as Google Apps, are still not ready to support large data sets and complex analysis. Googles storage and query model supports basic data management and retrieval but cant handle the complex analysis available in modern database management systems. But expect this capability to be offered in the future.
IaaS providers like Amazon, GoGrid and Rackspace offer on-demand scalable machine images from which you can build your own data warehouse in the cloud. Available by the hour, day or month, you rent machine time on private virtualized instances typically running Linux or Windows. Images may come preconfigured with Web servers, application servers and database servers just as you would set up internally for your database based applications.
Third-party providers, such as Elastra and RightScale, have created sophisticated dashboards and management tools around these services, offering world class tools for deploying and managing your virtualized data center. These services cut the process of hardware provisioning, software installation and configuration of database applications from weeks to a few mouse clicks. Indeed, one of the biggest advantages of native cloud solutions is near instant availability.
Cloud infrastructure providers offer you an opportunity to craft a best-of-breed solution for your data warehouse at a fraction of the cost of doing it in house, with availability, scalability and pricing at levels only achieved by the largest data centers in the world.
Best of Breed
With full-stack solutions already prepackaged and available for order, what can a best-of-breed solution offer? For small or medium-sized businesses looking at less than 100GB of data, prepackaged solutions may be sufficient. If you have already preaggregated your data and are looking for a high level view of company health or larger scoped analysis, then a SaaS solution gives you quick turnaround time and high value.
If, on the other hand, your data volumes are starting to grow into the half-terabyte or even terabyte-plus range and your users are demanding more sophisticated capabilities like building their own dashboards and creating OLAP cubes with drill-down capabilities, then a best-of-breed stack can give you a much more powerful solution for little additional cost and effort. By combining a cloud-based analytic database with a flexible BI package, you can affordably manage multiterabyte data warehouses without the hefty price tag and build out whats typically associated with complex data management.
Page 1 of 2.






