NOV 13, 2007 3:49pm ET

Related Links

Oracle to Buy Taleo
February 9, 2012
Birst Automates Connections to Big Data
February 8, 2012
PaaS Matures, But With Doubts
February 3, 2012

Web Seminars

6 Key Things to Fast Track your Mobility Strategy
February 23, 2012
Why Getting Started in MDM Doesn't Have to Be Difficult
February 29, 2012
Dashboards: How's Business? Ask your Data!
March 15, 2012

EII: Achieving Scalability and Cost-Efficiency

Print
Reprints
Email

The Problem of Heavy Middleware

 

The 1990s witnessed a significant amount of activity toward the development of enterprise information integration (EII) technologies that were aimed at addressing the ubiquitous problem of providing data integration across multiple, distributed and possibly heterogeneous data sources in the enterprise. Software vendors that included both startup companies as well as larger players were offering software based on a middleware architecture, where the idea was that the middleware would provide an access layer across the various data sources being integrated. This is integration as opposed to the data warehousing approach, where all data is loaded to and centralized at one place, i.e., the warehouse, for further analysis. Toward the late 90s, XML gained prevalence, addressing the problem of syntactic heterogeneity but not semantic heterogeneity across different information sources.

 

While functional, a key problem with the middleware approach to data integration was that significant amounts of effort and resources were required for managing and reconciling schemas - schemas describing data in the individual information sources as well as specifying linkages across schemas to form an integrated view of the information. The amount of time and resources required for schema management became a key impediment to EII technology being scalable and cost-effective for large applications. Indeed, as observed in an EII technology review, a connected thread to key impediments for EII is to address modeling and metadata management, which is the highest cost item in the first place.1 The original vision of intelligent information integration to nimbly achieve integrated access to information sources on-demand went awry. I trace this to some tacit, incorrect assumptions regarding how enterprise data should be managed and integrated. These assumptions, along with the alternative approach to addressing these issues are:

  • Data must always be stored and managed in DBMS systems. Actually, requirements of applications vary greatly, ranging from data that can well be stored in document-oriented formats such as spreadsheets or text reports to data that does indeed require DBMS storage.
  • The database must always provide for and manage the structure and semantics of the data through formal schemas. Alternatively, the database can be nothing more than intelligent storage. Data could be stored generically, and the imposition of structure and semantics (schema) may be done by clients as needed.
  • Managing multiple schemas from several independent source, and interrelationships between them (schema chaos) is inevitable and unavoidable. Alternatively, any imposition of schema can be done by the clients, only as and when needed by applications.

The Solution: Lean Middleware

 

Figure 1: Fragments in Unstructured Data

 

Middleware technology should be a cost-effective solution, not part of the problem as it is now. At the NASA Ames Research Center we have designed and developed a data management and integration system, called NETMARK, that achieves data integration across multiple structured and unstructured data sources in a highly scalable and cost-efficient manner. The querying and integration of (originally) unstructured data such as reports (in formats such as Word, PDF and others), spreadsheets (Excel) and presentations (PowerPoint) is a key focus, given that the bulk of enterprise data is indeed unstructured. A new paradigm we introduce is that of context sensitive querying and search. Let us illustrate this with examples. Consider the (Word) report in Figure 1. It comprises of several fragments, i.e., sections and subsections such as the project summary section, background subsection, etc. A spreadsheet in Excel can also be fragmented into various rows, cells, tables or sets thereof; similarly, a PowerPoint slide is typically comprised of a slide title and slide content. Each such section or subsection is considered as a context. For instance, in the report we have a ‘background context, in the example slide we have a constellation spirals context, etc. The actual content, that is the text, graphics or other material in a fragment, is referred to as content. For instance, the text in the background paragraph is the content associated with that context.

 

Figure 2: Context- and Content-Based Queries

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.