The challenges for today's enterprise information integration systems are well understood. In order to manage and use information effectively within the enterprise, three barriers that increase the complexity of managing information have to be overcome: the diverse formats of content, the disparate nature of content and the need to derive "intelligence" from this content. Current software tools that look at structuring content by leveraging syntactic search and even syntactic meta data are not sufficient to handle these problems. What is needed is actionable information from disparate sources that reveals non-obvious insights and allows timely decisions to be made. A new concept known as semantic meta data is paving the way to finally realize the full value of information. Indeed, Tim Berners-Lee's vision for the next generation of the Web is termed the "semantic Web," where semantic meta data plays the pivotal role. By annotating or enhancing documents with semantic meta data, software programs can automatically understand the full context and meaning of each document and can make correct decisions about who can use the documents and how these documents should be used. This article looks at how semantic meta data is created and used within the enterprise.

Definition: Semantic Meta Data

Meta data that describes contextually relevant or domain-specific information about content (in the right context) based on an industry-specific or enterprise- specific custom meta data model or ontology is known as semantic meta data. For example, if the content is from the business domain, the relevant semantic meta data could be company name, ticker symbol, industry, sector, executives, etc., whereas if the content is from the intelligence domain, the relevant semantic meta data could be terrorist name, event, location, organization, etc. Meta data elements that offer greater depth and more insight "about the document" fall under the semantic meta data category.

In contrast, syntactic meta data focuses on elements such as size of the document, location of a document or date of document creation that do not provide a level of understanding about what the document says or implies.

Requirements for Next-Generation Enterprise Information Integration

Let us view the value of semantic meta data from the perspective of deriving business value via enterprise information integration. Semantic meta data can play a critical role in satisfying a number of requirements that customers are seeking from the next generation of information integration and analysis software:

  • Extract, organize and standardize (or normalize) information from many disparate and heterogeneous content sources (including structured, semi-structured and unstructured sources) and formats (database tables, XML feeds, PDF files, streaming media, internal documents), and static and dynamic (e.g., database- driven) sources that may be internal or external to the organization (including deep Web and open Web).
  • For a domain of choice, identify interesting and relevant knowledge (entities such as people's names, places, organizations, etc., and relationships between them) from heterogeneous sources and formats.
  • Analyze and correlate extracted information to discover previously unknown or non-obvious relationships between documents and/or entities based on semantics (not syntax) that can help in making business decisions.
  • Enable high levels of automation in the processes of extraction, normalization and maintenance of knowledge and content for improved efficiencies of scale.
  • Make efficient use of the extracted knowledge and content by providing tools that enable fast and high-quality (contextual) querying, browsing and analysis of relevant and actionable information.

Semantic meta data is a key enabler of text analytics to derive business value from information.

Creating Semantic Meta Data

In order to extract optimal value from a document and make it usable, it needs to be effectively tagged by analyzing and extracting relevant information of semantic interest. Many techniques can be used to achieve this based on extracting syntactic and semantic meta data from documents. These include:

Dictionary and thesauri: Match words, phrases or parts of speech with a static or periodically maintained dictionary and thesaurus. Dictionaries such as WordNet can be used to identify and match terms in different directions, finding words that mean the same or are more general or more specific.

Document analysis: Look for patterns and co-occurrences, and apply predefined rules to find interesting patterns within and across documents.

Ontologies: Capturing domain- specific (application or industry) knowledge including entities and relationships, both at a definitional level (e.g., a company has a CEO), and capturing real-world facts or knowledge (e.g., Meg Witman is the CEO of eBay) at an instance or assertional level. If the ontology deployed is "one size fits all" and is not domain-specific, the full potential of this approach cannot be exploited.

The last option, also known as ontology-driven meta data extraction, is the most flexible (assuming the ontology is kept up to date to reflect changes in the real world) and comprehensive (since it allows modeling of fact-based domain-specific relationships between entities that are at the heart of semantic representations).

Definition: Ontology

Ontology is a shared conceptualization of the world as seen by the enterprise. Ontologies consist of definitional aspects such as high-level schemas and assertional aspects such as entities, attributes, interrelationships between entities, domain vocabulary and factual knowledge - all connected in a semantic manner. Ontologies and meta data provide the specific tools to organize and provide a useful description of heterogeneous content. The description incorporates as well as extends an automatic classification-supported approach of organizing content in a taxonomy.

In addition to the hierarchical relationship structure of typical taxonomies, ontologies enable cross-node horizontal relationships between entities, thus enabling easy modeling of real-world information requirements.

Semantic Meta Data Extraction and Enhancement

Once the ontology is built and the document is classified into its domain, intelligent agents automatically extract semantic meta data from the document. Based on the classification of the document, contextually relevant semantic meta data (entities such as Microsoft and BEA Systems in Figure 1) are extracted from the ontology to enhance the existing meta data.

Figure 1: Ontology-Based Semantic Meta Data Extraction and Enhancement

The semantic meta data created for the document in Figure 1 would include both direct relationships and indirect relationships. The direct relationships extracted are: BEA Systems, Microsoft and PeopleSoft all engage in the "competes with" relationship with Oracle. An important characteristic of semantic meta data is that it includes named relationships (such as "competes with"); traditional statistical and concurrence analysis lead only to unnamed relationships. Named relationships tell us why entities are related, enabling more automation and deeper insight. Depending on the internal configuration, the expert domain agents can further enhance the extracted entities with semantically associated entities from the ontology. An example of this would be: HPQ and HD are traded on the NYSE and BEAS, MSFT, ORCL and PSFT are components of the Nasdaq 100 index.

The next stage is to identify the indirect relationships. The use of semantic associations allows entities not explicitly mentioned in the text to be inferred or linked to a document by incorporating such associated entities in the tagging of the document. This one-step-removed linking is referred to as "indirect relationships." The relationships that are retained are application specific and completely customizable, and their inclusion makes it possible to traverse relationship chains to more than one level from within the document.

It is very important to keep in mind that semantic meta data is useful only because it is specific to the domain of business. If there was an insignificant reference to Donald Rumsfeld as the chief guest for a business occasion, then extraction of his name as a politician is not of value because Donald Rumsfeld is not contextually relevant semantic meta data and, therefore, does little to describe the overall content. The context of the content item is business, and only business-specific semantic meta data elements (such as company names, ticker symbols, financial indices, etc.) can accurately serve as descriptors of the overall content. In other words, the domain-specificity of the semantic meta data elements is key to establishing the right context and relevance.

As an example, an equity sales manager using semantic meta data technology would be able to identify relevant content that provide more insight into companies and banks in his sector. Such content leverages knowledge in the ontology and transcends beyond mere keyword search so that the manager is not only able to retrieve content on a topic such as CRM (customer relationship management), but also on related technologies, companies, sectors, etc. to offer a more comprehensive 360-degree view of CRM.

The use of ontologies provides the context for creating accurate semantic meta data, which is the key to providing actionable information and business insight within the framework of information integration. From data integration to application integration, the value of meta data has been long recognized. It is, however, only with the progression from syntax and structure to semantics (see Figure 2) that an increase in the control and the creation of business insight from documents will occur. It is through semantic meta data that both humans and software can start to associate meaning with documents.

Figure 2: Types of Meta Data Enabling Business Analytics