The notion of tackling the compute- and network-intensive demands of data warehousing using the seemingly lightweight services of cloud computing may sound counterintuitive. But today’s cutting-edge data warehousing technologies actually resemble cloud computing infrastructure and services. Consider, for example, how both employ shared-nothing grids that can easily scale out to add capacity, how their similar service-oriented architectures provide plug-and-play data sources, that both possess rich Web 2.0 analytics and visualization capabilities. The cloud is, in fact, a natural platform for the future of data warehousing. Cloud computing based on pay-as-you-go hardware infrastructure like Amazon or Google offers business intelligence users new data warehouse options that bring unlimited scalability without traditional data center overhead and budget constraints.

Wikipedia says that cloud computing is “a general concept that incorporates software as a service (SaaS), Web 2.0 and other recent, well-known technology trends, in which the common theme is reliance on the Internet for satisfying the computing needs of the users.” Cloud computing arose as companies like Amazon and Google recognized that the excess capacity of their immense infrastructure could be sold to others for a profit. Cloud computing’s pay-as-you-go business model is analogous to public utilities where consumers pay only for the energy or water they use. The cost of the underlying infrastructure is incorporated into the units of consumption such as storage and CPU time.

Another flavor of cloud computing combines the computing resources with applications, such as Salesforce.com, Google Apps and Microsoft’s cloud-based SharePoint and Exchange offerings.
Some kinds of database architectures are better suited for cloud computing than others. Since data warehousing typically involves large data volumes, scalability is key. The cloud provides a virtually unlimited pool of computing power, storage and memory. However, these resources are delivered in discrete modules. Each node consists of “standard” units of processing power, storage space and memory. While the amounts may vary (by service provider, price point, etc.) and they may increase over time, the cloud’s pool of resources is a large grid of interchangeable, industry-standard computing resources. Achieving true scalability requires a database architecture that can fully maximize this pool of resources. A shared nothing, massively parallel database architecture is particularly designed to take advantage of multiple units of computing resources. Here’s why:

Data Warehousing in the Cloud

Customers have lots of cloud-based data warehouse options. From pure SaaS or DaaS (data-as-a-service) offerings which provide a full software stack, to PaaS (platform-as-a-service) and IaaS (infrastructure-as-a-service) solutions on which you can build your own data warehouse, the cloud has very quickly shown to be fertile ground for managing increasingly large volumes of data.

Several SaaS or DaaS providers offer data analysis services. Using these services, you can employ powerful, full-stack data warehousing technologies with little effort and pay just for what you use. Companies such as 1010data, LogiXML and LucidEra offer various focused solutions for everything from a really big spreadsheet in the cloud to frameworks with built-in extract, transform and load and dashboards all the way through fully developed analysis tools customized for different verticals.

These solutions require no large outlay to get started. You can sign up using a Web browser, in some cases with a free trial, and start uploading data right away. These full-stack solutions include ETL tools for migrating data and automatically building out visualizations to slice and dice your data.

If you need full control over your data warehousing, or the volumes are larger than SaaS providers can handle, you have the option of rolling your own. If building a data warehouse sounds daunting, building one in the cloud would seemingly only complicate matters. But, in fact, the cloud is simple by comparison.

PaaS providers, such as Google Apps, are still not ready to support large data sets and complex analysis. Google’s storage and query model supports basic data management and retrieval but can’t handle the complex analysis available in modern database management systems. But expect this capability to be offered in the future.

IaaS providers like Amazon, GoGrid and Rackspace offer on-demand scalable machine images from which you can build your own data warehouse in the cloud. Available by the hour, day or month, you rent machine time on private virtualized instances typically running Linux or Windows. Images may come preconfigured with Web servers, application servers and database servers just as you would set up internally for your database based applications.

Third-party providers, such as Elastra and RightScale, have created sophisticated dashboards and management tools around these services, offering world class tools for deploying and managing your virtualized data center. These services cut the process of hardware provisioning, software installation and configuration of database applications from weeks to a few mouse clicks. Indeed, one of the biggest advantages of native cloud solutions is near instant availability.
Cloud infrastructure providers offer you an opportunity to craft a best-of-breed solution for your data warehouse at a fraction of the cost of doing it in house, with availability, scalability and pricing at levels only achieved by the largest data centers in the world.

Best of Breed

With full-stack solutions already prepackaged and available for order, what can a best-of-breed solution offer? For small or medium-sized businesses looking at less than 100GB of data, prepackaged solutions may be sufficient. If you have already preaggregated your data and are looking for a high level view of company health or larger scoped analysis, then a SaaS solution gives you quick turnaround time and high value.

If, on the other hand, your data volumes are starting to grow into the half-terabyte or even terabyte-plus range and your users are demanding more sophisticated capabilities like building their own dashboards and creating OLAP cubes with drill-down capabilities, then a best-of-breed stack can give you a much more powerful solution for little additional cost and effort. By combining a cloud-based analytic database with a flexible BI package, you can affordably manage multiterabyte data warehouses without the hefty price tag and build out what’s typically associated with complex data management.

Loading data into a cloud-based data warehouse is identical to loading any other database management system. You define a schema first. If your data exists in a database today – either an internal data warehouse or a transactional system – you can use the vendor’s migration tool to help extract and transfer the data. If you need to perform transformations, you can use commercial ETL tools for point-to-point data loading and transformations with built-in support for any number of source and target systems. The instance in the cloud appears like any other database instance in your organization.

The first time you load data into the cloud, you may choose to start with a small sample. As with any database management system, the database needs to be tuned and optimized. The database may include a database design tool that will analyze your schema, the sample data you have loaded and any sample query workloads you may already have. Using this information, you can create an optimized implementation of the database before fully loading it.

When you are ready to complete uploading your data, keep in mind that large volumes of data may take some time to send out over your Internet connection. You can stage uploads in multiple phases, running mostly during off hours and starting with the most recent and relevant data first. If your analytic database supports simultaneous load and query, you can start using the data you have uploaded immediately in order to design your dashboards, cubes and reports.

Securing and Scaling Your Data Warehouse

Managing data outside of the enterprise firewall is understandably a key concern. When the hardware is on your premises, access is strictly managed and closely guarded. But the bigger issue is securing your machines from external unauthorized access. Similar to any hosted environment, you should plan to deploy firewall and possibly virtual private network (VPN) solutions to safeguard your data. The safest solution is to lock down each host and provide access based on the external machines that are using the database or application services.
You may also choose to configure a VPN between the cloud machines and your internal users. Since infrastructure providers give you full root access, you can lock down and secure the environment just as you would in a dedicated hosting environment. VPN allows you to completely firewall your data warehousing solution, effectively creating an extension of your enterprise and taking advantage of the flexibility and scalability of cloud computing.

Now that you have built out your data warehouse, how do you add capacity in order to meet the inevitable increased demand of users? While SaaS providers have different models based on amount of data or number of users, you can incrementally scale each component if you build your own. For example, if your user load has gone up but the database is still holding steady, you may choose to add an application server by booting up a new server, and pointing it at your report repository and your database. You can save an image of your application server configuration so your new instances boot up preconfigured.

When your database needs new capacity, native cloud-based database management systems that employ a shared-nothing architecture make the process just as easy. After booting up a new machine, you can add it to your database with a simple click. From there, you can choose which tables to replicate or resegment on to the new machines and let the database management system handle the transition in the background.

Data warehousing is not as foreign to the cloud as it might appear. Managing large volumes of data on virtualized dynamic instances is more like flying an airliner than drag racing a dump truck. With the high bar set by cloud computing services, many of the operations that are costly and time-consuming have been fully automated while still giving you control over your data management and data security.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access