The virtualization revolution is upon us: first storage, then servers and applications, now data itself. Data virtualization, also referred to as a virtualized data layer, an information grid or information fabric, brings together data from multiple, disparate sources - anywhere across the extended enterprise - into a unified, logical virtualized data layer for consumption by nearly any front-end business solution, including portals, reports, applications and more.

 

Data virtualization is increasingly being recognized as the better way to integrate data when the consuming solutions need real-time data from multiple silos and complex sources.

 

Enterprises are facing growing challenges in using disparate sources of data managed by different applications, including problems with integration, security, performance, availability, and quality. Business users need fast, real-time and reliable information to make business decisions, while IT wants to lower costs, minimize complexity and improve operational efficiency. New technology is emerging that Forrester has dubbed “information fabric,” defined as a virtualized data layer that integrates heterogeneous data and content repositories in real time.1

 

The Case for a New Approach to Data Integration

 

To keep pace with constant changes in the business, IT has been aggressively delivering new solutions that integrate existing data from a complex, ever-changing infrastructure. Enterprises that limit themselves to traditional data integration methods are less competitive than those that adopt them.

 

When accessing a few data sources with well-understood syntax and common structures, integrating data using custom code is effective. But, the limitations of hand coding materialize quickly as data silos proliferate, accompanied by new structures (XML or complex syntax such as enterprise applications like SAP), and the data needs of consuming applications become more diverse.

 

Alternatively, replication-based data integration methods, including file extracts, database replication, data marts and data warehouses, have emerged as an alternative to hand coding. However, replication-based approaches have their own set of limitations, such as:

  • Batch refreshes slow down real-time information delivery.
  • Building and testing extracts and marts add development time to every project, delaying timely business decision-making.
  • Controlling replicated data and maintaining additional physical data stores are resource intensive, thereby exacerbating the data proliferation problem and adding business costs.
  • Only a subset of use cases requires multi-dimensionality and other complex transformation capabilities.
  • Typical replicated architectures don’t align easily with modern, real-time service-oriented architectures (SOAs).

Technology advances have opened the door for new data integration methods. Advanced query optimization techniques, combined with low-cost, high-performance server and network architectures, mitigate many of the performance issues that originally motivated replication. Furthermore, server and storage virtualization advancements have demonstrated dramatic cost savings while hiding the ever-increasing complexity of the IT factory.

 

What is Data Virtualization?

 

Data virtualization is a new approach to data integration based on virtualized or logical, rather than physical, integration. It leverages recent technology advances and overcomes many of the issues associated with hand coding and replication-based approaches. Enterprises are using data virtualization to gain dramatic time and cost savings for development projects where any or all of the following characteristics are important:

 

  • Time to solution and frequent change place a premium on agility.
  • The consuming business solution requires real-time insight from fast-changing sources.
  • Data volumes, transformation and cleansing workloads are supportable at run time.
  • Replication is constrained by data ownership or compliance rules.
  • Development and support costs must be reduced.

Figure 1: Key Data Virtualization Capabilities (Courtesy of Composite Software, Inc.)

 

Within the virtual data layer depicted above, virtualization serves up data as if it is available from one place, regardless of how it is physically distributed across data silos. Virtualized data is readily available whenever it is needed, and therefore, it is always up to date.

 

Abstraction simplifies complex data by transforming it from its native structure and syntax into reusable views and Web services that are easy for business solutions developers to understand and business solutions to consume.

 

Federation securely accesses diverse operational and historical data, combining it into meaningful business information such as a single view of a customer or a “get_inventory_balances” composite service. Query optimization and other techniques such as caching enable high performance.

 

At build time, data virtualization solutions provide an easy-to-use data modeler and code generator that abstract data in the form of relational views for reporting and other business intelligence (BI) uses or Web data services for service-oriented architecture (SOA) initiatives, portals, etc.

 

At run time, data virtualization solutions provide high-performance query capabilities that securely access, federate, transform and deliver data to consuming business solutions in real-time.

 

Data virtualization can be done for a single project or applied to a range of initiatives that require data integration such as BI, reporting, SOA, customer data integration (CDI)-master data management (MDM) and more.

 

Data Virtualization in Action

 

The following are three examples of data virtualization applied to real-life business situations. A drug discovery portfolio management portal illustrates how a virtualized approach to data integration works. This project is just one of more than 20 data virtualization projects now in production at a leading pharmaceutical company. Senior management, project team leaders, business analysts and research scientists continuously review and evaluate their portfolio of in-process drug discovery and drug development projects. This collaboration requires a wide range of project status information, costs, resources, timelines and other data. While real-time integration is important, the breadth of data required and the ever-changing business needs of these diverse users were the key design considerations favoring a virtual over a replication-based approach. Although considered, duplicating data was ultimately rejected because it would have added too much cost and significantly slowed the rapid, iterative development required. Using virtualization, IT developers add, test and deliver data from new sources in minutes without involving the IT operations groups for data mart rebuilds and similar efforts.

 

In the financial services industry, investment managers responsible for large-equity portfolios are using data virtualization to help improve investment decision-making. A wide range of equity financial data from a number of financial research databases feeds information to their portfolio analysis models. To hide the complexity of the source data and make it easier and faster for the analysts to access the right data, this investment firm created a data virtualization layer that abstracts the data into a set of high-performance views that can be shared by a number of analysts and models. The loose coupling of the sources and consumers has also helped provide flexibility when several source systems were retired or replaced by outside financial information services.

 

In the third case, a top investment bank is using data virtualization to provide a single view of positions and trades for its key customers. Equity, debt and other financial instrument trades are transacted in different trading systems. But, customers need to see their consolidated positions and trades any time they want. And, at month end, these positions and trades must be reported in monthly statements. By virtualizing the data, the investment bank federates customer information on the fly without waiting for the next refresh. With high-performance query optimization, they produce the tens of thousands of monthly reports required well within their batch windows.

 

Data Virtualization Applied

 

Just as replication-based data integration approaches have strengths and limitations, so do virtualized approaches. When choosing between the different approaches, project teams need to consider a number of key factors with respect to the end-business solution and source data requirements, including:

  • Schema. How does data relate? And how should it? Normalized and/or denormalized models? How much to replicate? How much to virtualize?
  • Agility. How well understood are the requirements? Will they change? How “set in stone” should the schema be if there is the need to support rapid, iterative development?
  • Volatility. Does the user require up-to-the-minute data? How often does the source data change? How will you ensure the data is fresh - periodic ETL refreshes, changed data capture or federated query on demand?
  • Performance. What is the service level agreement (SLA) for the consuming business solution? And the SLA for the source systems? How complex are the queries? How much volume? Pure replication, pure virtual or a mix along with some caching?
  • Transformation. How much transformation is required? Is the goal dimensioning the data for heavy-duty analysis? Or is it to combine disparate types (XML, relational, etc.) into an easy-to-understand, easy-to-use tabular form?
  • Quality. How much cleansing? And where will cleansing occur? Fix the source data? Cleanse during replication? Is virtualization an agreed-upon, best version of the truth?
  • Security. What are the field- and row-level access and authentication rules? Are there constraints on replication due to compliance rules or ownership boundaries? Is encryption needed?
  • Reuse. Will the source data be used by other consuming applications? How will the reuse be implemented? Provide reusable views and data services? Or create an all-inclusive warehouse?
  • Cost. How much budget is available for data integration on this project today? Will the extra costs required for replication be covered in the current budget? Will the greater flexibility inherent with virtualization provide extra value/cost savings in the future?

These criteria should be applied on a project-by-project basis early in the design cycle. The answers highlight which projects require replication and those best suited for virtualization. For projects that can go either way or require a combination, virtualization is almost always faster to build and less costly to run.

 

Companies with mature Integration Competency Centers are best positioned to help in this decision process because they have the broadest understanding of capabilities and tradeoffs. Further, these centers can be invaluable homes for data virtualization design, development and deployment expertise as this approach grows from project to enterprise deployment.

 

The Data Virtualization Bottom Line

 

Data virtualization, following the proven path of storage, server and application virtualization, is a nascent enterprise-scale, data integration approach that overcomes physical complexity to accelerate business initiatives and radically lower costs. When compared to replication-based approaches, data virtualization has a number of advantages for both business and IT including:

  • Delivers real-time information without the need for refreshes or complex changed data capture.
  • Builds new solutions in days, not weeks, so IT can provide value to the business sooner by avoiding development and testing of physical stores.
  • Reduces unnecessary data replication and silo proliferation thereby reducing business and IT costs, while improving data control.

Where the business requires agility to meet frequently changing demands or fast-changing sources; where data volumes, transformation and cleansing workloads are supportable at run-time; and where development and maintenance costs must be reduced, then taking a new data virtualization approach is the better way to integrate the data than traditional methods.

 

Reference:

  1. Noel Yuhanna and Mike Gilpin. "Information Fabric 2.0: Enterprise Information Virtualization Gets Real." Forrester Research Inc., April 9, 2007.

 

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access