Using an Enterprise Information Integration Platform Approach
Enterprises spend significant resources keeping their data architectures manageable and useful for the people who need the assets they contain. For data architects, it can seem like one step forward and two steps back - spend significant effort keeping track of OLTP systems, data warehouses and data marts, building meta data repositories, developing, and sometimes even reverse engineering, data models, only to see the rise of more stovepipe databases and applications.
As enterprise applications continue to evolve and their data requirements change, keeping tabs on the enterprise's data assets and avoiding costly replicated efforts and errors becomes a challenging task. Applications would benefit by leveraging existing data assets where possible, applying integration technologies to get a coherent picture of enterprise data. While there are several methods available for integrating data assets, many situations are too complex or require more flexibility than some traditional methods enable. This article introduces an enterprise information integration approach - how it works and how it addresses the critical issues involved in data integration.
Organizing the One Source of Truth
Most enterprises build reference data sources that are designed as the single source of truth for a particular data domain. For instance, a financial institution might build an authoritative source of market data and provide it as a service for the rest of the enterprise. Applications developed within different domains in the enterprise can then take advantage of these services to deliver higher business value.
In fact, individual applications and databases are built all the time to serve a parochial business need. However, these local data sources often overlap, or worse, replicate existing information sources. And frequently, these applications and data sources are invisible to the rest of the enterprise. Without enterprise data architects' vigilance, these applications end up creating their own silos of information - and varying versions of important enterprise data.
In a service-oriented data architecture, where data sources are designed to serve multiple areas of the company, the data architect's job becomes one of ensuring that the required applications can successfully leverage existing, multiple data sources in order to paint a true picture of the state of the enterprise. Not only must the data architect physically and logically connect these sources, he must bridge disparate data models, query languages, programming interfaces and protocols to create a semantically consistent pool of data that the application needs. There are three primary means of achieving this goal.
If the integration required is "simple" and "rigid," it could be achieved by writing code. Simplicity implies no more than two data sources - preferably homogeneous data models and compatible programming interfaces - and no semantic complexities. Rigidity implies that application requirements, underlying data source schemas and data semantics do not change. This approach is the least expensive alternative available. However, if the integration is not truly simple and rigid, the IT group pays the price in prolonged development and maintenance efforts.
Integrate via ETL
A second approach is to extract, transform and load (ETL) the data from multiple data sources into a local application database. This can be achieved through a commercial ETL tool or by scripting in a favorite programming medium. An ETL approach is preferred when the application is read-only, does not require real-time data and, most importantly, is based on finite data domains. ETL enables architects to build a local application database, which can be tuned as required by the application. The local database will transform and resolve semantic inconsistencies at the time the data is extracted, not at query time. However, the application cannot update original data sources, and because extractions are often scheduled, it does not have access to data in real time. Some ETL tools provide change data capture, which can narrow this window. Most ETL tools deal with relational data sources, making it difficult to support any data that cannot be seen as rows and columns.
A New Approach
A third approach is to use enterprise information integration (EII) platforms. An EII platform hides the complexity of the variety of data sources being integrated and exposes a single data model, query language and programming interface to the end user. An EII platform approach is the preferred approach to use when:
- There are a variety of data models, query languages and programming interfaces involved in the integration.
- There is a need for real-time access to operational information.
- The data architecture might be changing or in transition.
- Integration semantics are simple.
For instance, an EII platform can support a relational data model, SQL query language and ODBC/JDBC APIs for access to data. Underneath the hood, that EII platform might be accessing relational databases, Web services, flat files, message queues, IMS, CICS or custom applications. This significantly reduces application development complexity when the information sources and the ways to access them are varied. Also, because EII distributes queries in real time across the information sources, users will receive the freshest information available at query time. An EII approach also shields applications from changes in the underlying data architecture, as those changes can be easily compensated in the EII layer by changing mappings.
At its core, an EII solution is a translator. It translates from the language used by the application when submitting a query to all of the languages and interfaces supported by information sources that can answer that query. An EII solution also coordinates and optimizes interactions with these back-end information sources, resolves any semantic inconsistencies and delivers a consolidated result to the requesting application in a format it understands. Figure 1 illustrates the process. Individual EII platforms will differ in terms of the data model and query language they support, the nature and efficiency of optimizations they perform in distributed queries, or in terms of their coverage of information sources -- but their fundamental makeup is still the same.
Figure 1: EII from 10,000 Feet
Query Language and Data Model
Looking at an EII platform from the top down, the first layer represents the query languages and data models supported by the EII platform, which defines how applications built on top of the EII platform view underlying data assets. For example, the EII platform might support the SQL query language and the relational data model. In this case, the applications built on top of the EII platform see a single virtual relational database that represents all the enterprise information sources of interest. Alternatively, the EII platform might support the XQuery query language and the XML data model, and the application would see a virtual XML database.
The EII platform will also support the appropriate APIs such as ODBC, JDBC, XQJ, JAXP, DOM and/or SAX for accessing the virtual database and processing query results. The application architect and the data architect must decide which interface will be supported by the EII platform for that application, based on the nature of data the application produces and consumes and its place in a larger application architecture.
EII platforms supply modeling tools that help architects define a virtual database schema, and these tools can then map the schema to those of the underlying information sources. This tool is usually a user interface that allows architects to easily visualize the data models of the underlying data sources and to map from those data models to the unified data model of the EII platform (relational, XML, object, etc.).
Translation and Coordination
Applications pose queries with respect to the virtual database schema defined for the EII layer. The EII platform's task is to take a query and translate it for the underlying data sources. For instance, a SQL query sent to the EII layer may require the EII platform to access relational databases, Web services and flat files. The platform then must develop a query plan that determines the constituent parts of the input query and a plan for coordinating its execution across the multiple data sources. With the help of mapping information, queries are rewritten in a form that the underlying information sources understand. These new queries could be SQL queries, Web service calls, calls to custom application interfaces or brute force scans, depending on the type of information source. Translation between query languages and interfaces is one of the important tasks that an EII platform performs.
The EII platform also must also coordinate and optimize translated query execution across multiple data sources. Known as federated query processing, this effort requires optimization techniques that go beyond traditional SQL or XQuery optimizations. Capabilities vary across products in this area with support for features such as rule or cost-based query optimization, caching and techniques for handling large result sets.
End-user application performance will greatly depend on optimization techniques employed by the EII platform. Whether or not the underlying data source is queryable, whether it can provide statistics for cost estimation, or whether it provides any indexing, the EII platform should shield application performance from these underlying data source characteristics. EII products will differentiate themselves on how well they will optimize and orchestrate queries across disparate data sources.
When merging multiple data sources into a virtual whole, semantic inconsistencies are inevitable. They can range from mundane issues such as key values to vocabulary management and deeper semantic differences in how data attributes are used and understood. EII platforms should address this aspect of information integration by giving data architects the ability to identify these issues during data mapping and to apply rules checking that can spot inconsistencies and potentially remedy them at runtime. The platforms can also include correlation (or concordance) repositories that can help bridge the differences between individual information sources.
The Same Enterprise Language
Each application's information needs are different. Yet, being able to leverage a current set of enterprise information sources as much as possible enables users to better make decisions and base their work on a common source of enterprise truth. An EII approach offers a flexible platform for unifying information sources that applications need while making access and semantic integration complexities invisible to users -- and enabling the integration of real-time data. EII is a useful approach for evolving data architectures to become more flexible in meeting the dynamic enterprise information needs.
EII and DI
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access