The Problem of Heavy Middleware
The 1990s witnessed a significant amount of activity toward the development of enterprise information integration (EII) technologies that were aimed at addressing the ubiquitous problem of providing data integration across multiple, distributed and possibly heterogeneous data sources in the enterprise. Software vendors that included both startup companies as well as larger players were offering software based on a middleware architecture, where the idea was that the middleware would provide an access layer across the various data sources being integrated. This is integration as opposed to the data warehousing approach, where all data is loaded to and centralized at one place, i.e., the warehouse, for further analysis. Toward the late 90s, XML gained prevalence, addressing the problem of syntactic heterogeneity but not semantic heterogeneity across different information sources.
While functional, a key problem with the middleware approach to data integration was that significant amounts of effort and resources were required for managing and reconciling schemas - schemas describing data in the individual information sources as well as specifying linkages across schemas to form an integrated view of the information. The amount of time and resources required for schema management became a key impediment to EII technology being scalable and cost-effective for large applications. Indeed, as observed in an EII technology review, a connected thread to key impediments for EII is to address modeling and metadata management, which is the highest cost item in the first place.1 The original vision of intelligent information integration to nimbly achieve integrated access to information sources on-demand went awry. I trace this to some tacit, incorrect assumptions regarding how enterprise data should be managed and integrated. These assumptions, along with the alternative approach to addressing these issues are:
- Data must always be stored and managed in DBMS systems. Actually, requirements of applications vary greatly, ranging from data that can well be stored in document-oriented formats such as spreadsheets or text reports to data that does indeed require DBMS storage.
- The database must always provide for and manage the structure and semantics of the data through formal schemas. Alternatively, the database can be nothing more than intelligent storage. Data could be stored generically, and the imposition of structure and semantics (schema) may be done by clients as needed.
- Managing multiple schemas from several independent source, and interrelationships between them (schema chaos) is inevitable and unavoidable. Alternatively, any imposition of schema can be done by the clients, only as and when needed by applications.
The Solution: Lean Middleware
Figure 1: Fragments in Unstructured Data
Middleware technology should be a cost-effective solution, not part of the problem as it is now. At the NASA Ames Research Center we have designed and developed a data management and integration system, called NETMARK, that achieves data integration across multiple structured and unstructured data sources in a highly scalable and cost-efficient manner. The querying and integration of (originally) unstructured data such as reports (in formats such as Word, PDF and others), spreadsheets (Excel) and presentations (PowerPoint) is a key focus, given that the bulk of enterprise data is indeed unstructured. A new paradigm we introduce is that of context sensitive querying and search. Let us illustrate this with examples. Consider the (Word) report in Figure 1. It comprises of several fragments, i.e., sections and subsections such as the project summary section, background subsection, etc. A spreadsheet in Excel can also be fragmented into various rows, cells, tables or sets thereof; similarly, a PowerPoint slide is typically comprised of a slide title and slide content. Each such section or subsection is considered as a context. For instance, in the report we have a background context, in the example slide we have a constellation spirals context, etc. The actual content, that is the text, graphics or other material in a fragment, is referred to as content. For instance, the text in the background paragraph is the content associated with that context.
Figure 2: Context- and Content-Based Queries
Users pose queries in terms of context and content where they are able to search and retrieve particular fragments of interest. For instance, a query such as Context=Procurement would return all fragments from a collection of documents, where the context contains the word procurement. We illustrate using informal query syntax here. Similarly the query Context=Procurement & Content = Contract will return all fragments which contain the word contract within a context of Procurement.
Figure 3: The NETMARK Information Cycle
NETMARK supports such context- and content-oriented queries over a collection of unstructured data of literally any type common in the enterprise. This has proven to be a powerful and effective paradigm for retrieval in real applications. In addition to data management and integration, we have also considered other issues in the enterprise information lifecycle. Providing data to an integration application should be an easy process requiring minimal effort from the user. In fact, many existing EII technologies require that any data to be integrated should be massaged or marked to a certain format or wrapped for translation. NETMARK provides a capability where data can simply be provided as-is. Providers drag and drop their data (such as a folder with several reports, spreadsheets, etc.) into a NETMARK folder on their desktop and the system then formats and structures it appropriately for integration. At the data consumer end, we further provide capabilities for quickly composing reports and presentations over the integrated data.
Finally, the system incorporates and interoperates with open and widely used data representation and exchange standards. All data is ultimately represented and stored in the XML format and open protocols such as Web-DAV are used for client-server communications.
The NETMARK architecture is illustrated in Figure 4 below. Clients, i.e., data producers and providers and data consumers (or both) access NETMARK through a Web interface, we provided some illustrations of this interface (context and content querying) above.
Figure 4: NETMARK Architecture
The NETMARK Daemon and the SGML Parser provide functionality for loading data (documents) into NETMARK, i.e., a continual process (the daemon) reads in any new documents inserted into a NETMARK folder and then invokes an SGML parser for structuring it and loading it into the NETMARK XML data store. The data store is a relational DBMS.
Figure 5: Data Storage in NETMARK
The context- and content-oriented manner in which all data is modeled leads to a very efficient mechanism for storing the data coupled with efficient retrieval. The data storage cycle is as follows (see Figure 5). Unstructured data is provided to NETMARK by placing the data in a NETMARK folder. NETMARK then automatically structures the data and converts it to XML. This conversion is done based on heuristics that take into account the document format (titles, headings, etc.) to fragment a document. Each document gets marked up as context and content blocks of XML. Each block is then represented as a node. We will not go into the details. but a node is essentially the fundamental unit that captures the information in each context and content. The nodes are stored in a relational table (called XML). Information about original documents is maintained as well in a second table (called DOC).
Note that with this representation strategy, the information in any document is ultimately stored in the same two relational tables XML and DOC. This representation is independent of any schema associated with the document and is thus termed to be schema-less.
Our approach, as we have described, is to essentially do away with schemas for the most part and impose them only on an if-required basis. Two questions arise. Is this schema-less paradigm where we mostly issue context and content queries adequate and effective for real-world integration applications? Do we achieve any scalability gains with the simplified representation?
To answer the latter, we conducted extensive experimental results (over context and content queries) comparing NETMARK with a system such as Berkeley XML. While we do not present the results in detail, they validate that with the schema-less approach we are able to process most queries very efficiently, typically 30 to 40 times faster compared to a system such as Berkeley XML.2 To answer the former issue, we describe some applications using NETMARK in the NASA enterprise.
A key feature of the NETMARK storage system is the simplicity by which applications and users can manage data and retrieve data. X-Path and X-Search based systems rely on a complex query language, are challenging for new users and nontechnical users to adopt and are not well-suited to ad hoc queries of information in a semistructured data store such as NETMARK. NETMARK provide the ease of use of a full-text search system with the capabilities of a semistructured data store for information retrieval and analysis.
NETMARK has been deployed in several applications in the NASA enterprise; further, it serves as the integration engine of other more expansive systems for information and process management. With NETMARK, we have been able to assemble new integration applications very quickly with minimal software development effort (zero in many cases) and typically requiring just about two man days for system setup and application assembly. One such application is the analysis of mishap reports for aviation safety in NASA. Such analysis reports are typically text reports describing the analysis of a range of accidents involving NASA and non-NASA aircraft. Using NETMARK we are able to select particular sections and subsections from multiple reports and further load this information to data analysis and visualization tools. The integrated access and analysis capabilities over integrated data have proved invaluable to aviation safety analysts at NASA, also the assembly of the application was done with minimal effort and time. The NX system is the result of a strategic collaboration between NASA and XEROX Corp. where NETMARK has been integrated with many XEROX Docushare capabilities for text and document management. NX offers a suite of capabilities in 1) content management, including capabilities for content and document management and sharing, distribution and collaborative sharing, and 2) content process management, i.e., business process activities such as tracking and compliance. NASA has implemented the NX technology at six centers and in various programs, including the following: The NASA ISS (International Space Station) uses NX to mine information for historical decisions and safety assurance information; NASA Program Analysis and Evaluation (PA&E) adopted NX in 2005, which led to adoption by the NASAs strategic management council; and most NASA centers use the NX platform, including Ames, Langley Research Center, Goddard Space Flight Center, Dryden, Jet Propulsion Lab, Johnson Space Center and NASA Headquarters.
Another enhanced application is the Program Management Tool (PMT), a custom-built business intelligence solution for NASA to successfully manage large programs. NETMARK is the underlying data management and integration engine for PMT. PMT enables program and task managers to communicate success-critical information on the status and progress of all program levels in an efficient and always current manner. PMT keeps track of project goals, risks, milestones and deliverables, and assists the proper allocation of financial, material and human resources. It is well integrated with other agency-wide information systems. PMT supports all essential program management activities and corresponding documents, such as the creation and monitoring of annual task plans, monthly reporting of technical, schedule, management and budget status, tracking budget phasing plans, analyzing program risks and mitigation strategies, reporting and evaluating project lifecycle costs, accessing convenient aggregated views, and automatically creating earned value management assessments, quad-charts and other reports. PMT also provides integrated access to multiple distributed resources across the NASA agency, namely the ERASMUS reporting system. ERASMUS is an executive reporting system and project performance dashboard that includes performance metrics of all NASA centers, programs, projects, and safety and health activities. PMT also provides integrated access to NASA Technology Inventory Database (an inventory of technologies developed by or under development at NASA) and the Integrated Financial Management System IFMP (an agency-wide information system supporting NASA financial management activities), thereby significantly reducing cost and time for entering the same data multiple times into different systems.
NETMARK for Noncommercial Use
It is the NETMARK development teams intention to release the NETMARK system with an open source license for general availability. The license, capabilities and process for receiving the software has yet to be determined, but NASA Ames Research Center has a history of releasing software into the open source community, and we expect that the NETMARK system will be released in this manner soon. Interested groups may contact David Maluf at David.A.Maluf@nasa.gov. Also, information about PMT, including system overview, demonstrations and other documentation can be obtained at http://pmt.arc.nasa.gov/.
- A. Halevy, N. Ashish, D. Bitton, M. Carey, D. Draper, J. Pollock, A. Rosenthal and V. Sikka. "Enterprise information integration: Successes, Challenges and Controversies." ACM SIGMOD, 2005.
- D. Maluf, D. Bell, N. Ashish, C. Knight and P. Tran. "Semi-Structured Data Management in the Enterprise: A Nimble, High-Throughput, and Scalable Approach." International Database Engineering and Applications Symposium (IDEAS), 2005.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access