Data clouds pose an interesting dilemma to enterprise IT organiza-tions. On one hand, they promise to drastically reduce the cost and complexity of storing enterprise data. On the other hand, they create numerous migration challenges. When considering a data cloud implementation, enterprises currently have two primary options: they can deploy an internal data cloud or they can rely on an existing third-party, public data cloud like Amazon Simple Storage Service (Amazon S3) or Rackspace Cloud. While some of the fundamental challenges like appropriate security and governance potentially exist in both deployment scenarios, a deployment in a public cloud has an additional and critical limitation - moving a large amount of data into a public cloud can take months or years because of the constraints imposed by insufficient network bandwidth. Werner Vogels, CTO, Amazon.com, describes this issue in his blog, “All Things Distributed.” Vogels contrasts the number of days it would take to siphon a set of data to Amazon using different network bandwidths. See Figure 1 on page 15 for a partial set of his findings.

In other words, transferring one terabyte of enterprise data to a public cloud using typical public network speeds of T1 to 10 Mbps takes between 13 and 82 days. Considering that most large enterprises have data volumes reaching several petabytes, one might conclude that utilizing a public cloud is not as practical as initially predicted.

Migrating Data into the Cloud

While some companies have standardized on a small number of data sources, the majority of medium and large-sized enterprises use a wide variety of relational, nonrelational and packaged application data sources (see Figure 2). This variety of data sources typically reaches into the hundreds, with new ones continuously added on a monthly basis. In addition, the average enterprise uses different data source versions (some companies have reportedly deployed up to three different versions of the same product from a single vendor) and a broad variety of data source types. The result is that migrating all the enterprise data into a cloud in a meaningful way will take Goliath-like efforts.

There are two popular approaches for effectively migrating enterprise data into the cloud. The first is to batch-load from the data sources directly into the cloud. To ascertain if this is the best choice, the IT organization should determine if the data stored in the cloud will be shared among many applications or if it will be compartmentalized for use by individual applications. If multiple applications use the same data sets predominantly for read-only purposes, then sharing these data sets in the cloud is likely to be safe. However, because many enterprises frequently copy original data into multiple locations to either increase performance of the local applications or to combine it with other data per business users’ needs, the better path might be for IT to create a consolidated model that can be ported into the cloud. This consolidated model is created by data discovery and analysis that identifies all copies and permutations of the data. On the other hand, if multiple applications perform updates and writes into the data source, then compartmentalizing the data set for exclusive use by individual applications is probably going to be the only viable option (see Figure 3).

Once IT identifies the data to be migrated into the cloud, it then develops extract, transform and load batch and transformation scripts to be executed to migrate the data into the cloud. This process can take several hours to several days, depending on the volume of data that needs to be moved. Because the data in the originating sources typically continues to be updated throughout the migration process, the IT organization will also need to develop scripts that synchronize changes and are executed on a periodic basis until the enterprise applications are “completely switched” to work against the cloud data. All in all, this is a time- and resource-intensive process throughout its entire cycle.

The second option for data migration into the cloud is data virtualization. Data virtualization offers several key advantages over manual batch loading. First, data virtualization fully abstracts the data from the sources and the accessing applications. Thus, a data model that is put in place for the data virtualization layer can also serve as the initial data cloud model for the particular data sets abstracted. Second, instead of batch loading the entire data set, data virtualization allows the IT organization to load data into the cloud on demand. The IT organization accomplishes this by configuring the data cloud to use the data virtualization layer as a single data source. Third, data virtualization removes the complexities associated with continuous changes to the data sets by allowing a phased migration with some of the enterprise applications continuing to access the data through the virtualized layer, and others accessing the data in the cloud - the changes made in the cloud are automatically synchronized back into the originating data sources using data virtualization middleware’s pass-through capabilities (see Figure 4).

Migrating Applications to the Cloud

Enterprise data does not sit in a vacuum - it is created, updated and consumed by a variety of enterprise applications, both custom-coded and packaged. A large enterprise can have dozens and even hundreds of applications that are configured to access and work with particular data sources. Most of these applications were not designed to access data in the cloud - rather, the majority of applications store data in some relational format and often rely on SQL to access the data over a standardized interface such as ODBC or JDBC. This can pose a major issue for enterprises planning and executing a migration to the cloud because the IT department typically needs to develop proprietary integration components, which can be quite time- and resource-consuming. The alternative to developing proprietary integration components is to use data virtualization.

Data virtualization accomplishes this by providing standardized client-side access, including the use of ODBC, JDBC and Web services interfaces, and therefore it easily abstracts the access to the data cloud for the enterprise applications. Moreover, because many enterprises often develop custom business logic that partially resides in the data layer in the form of stored procedures or PL-SQL procedures, the data virtualization layer can also federate access to these business-logic elements in combination with providing the transparent access to the data cloud. This is particularly useful for staged migration, as it reduces the complexity of the migration process itself (see Figure 5).

As enterprises design and deploy data cloud architectures, they should consider how to efficiently migrate their present enterprise data into the cloud while providing transparency to the existing enterprise applications and uninterrupted service to the business users. Data virtualization addresses several fundamental challenges of data migration to the cloud by abstracting access to data resources for the cloud and also synchronizing data between the legacy sources and the cloud. At the same time, data virtualization provides transparent access to the data residing in the cloud to the existing enterprise applications that were not originally designed to work with data clouds.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access