If you were to use a standard Web search engine and search for the word "sap," it is likely that most of the results would pertain to a particular software vendor whose products are used for enterprise resource planning (ERP) and business intelligence (BI).However, let's assume that you are a high school student doing a science report on maple syrup. You are less interested in tracking inventory and more interested in how maple tree sap is used to make food products. Or, perhaps you are not interested in ERP or in botany, but rather in nineteenth-century police equipment. In fact, there are a number of other areas of interest that use the "sap" character string. In any of these cases, in order to garner appropriate references, you would have to add phrases (such as "tree" for the maple syrup, or "blackjack" for the police weaponry) to try to weed out inappropriate results.

By providing additional phrases, you hope to limit the search to those pages whose information reflects the co-occurrence of the different words. By doing so, you are transcending the boundary between syntactic searching and semantic searching. Syntactic searching is based on finding all references embedding a particular string of characters bounded by white space ­– no meaning is implied at all. By providing two (or more) character strings, you are asserting a meaningful relationship between the concepts associated with those character strings, even though the distinction is irrelevant to the search engine. This simple introduction of meaning into the search can significantly narrow the results to those that are relevant, which ultimately improves the searching experience.

If an inadvertent introduction of semantics into the search process increases productivity, then it would be great to be able to incorporate semantics as an integral component of the application. This implies that:

  1. Semantic classification is incorporated into the document processing.
  2. Semantic hierarchies are included in the client application.

For the first task, let's recall that the job of the search engine's back end is to crawl through linked documents and generate a reverse index that maps relevant words found in the document back to the location of the document. Semantic classification is a process of assigning a set of meaningful concepts to the content within a document. As an example, a document containing the words "SAP," "ERP," "business" and "financials" would allow the back end to associate this document with the software vendor, as well as other concepts such as "management software," "resource planning," etc. On the other hand, a document containing the words "sap," "tree" and "botany" could be associated with the concept "trees."
To add semantic classification, the back end would have to not only archive the index, but also review the content within the document to determine if there are any ways that specific concepts can be related to the content. The words that are extracted from the document need to be compared to a concept hierarchy that can help in document classification. Luckily, there are companies and organizations that are preparing taxonomies and ontologies that provide this functionality. Taxonomies are ordered classification systems consisting of hierarchies of words or concepts used to assign meta information about the topics covered within a document. An ontology can be loosely defined as the collection of taxonomies as well as a framework for expressing relationships and concept assignment to documents (or any entities) based on the taxonomies.

The second task is to impart the same kind of hierarchical breakdown to the searcher so that he or she can assist the search engine in narrowing the search. One way to do this is to treat the character string entered in the search field as its own virtual document and attempt to classify it using the same ontologies and taxonomies used for the corpus of searchable documents. By doing so, the search engine can attempt to narrow the field a priori and only present categories of documents based on concepts associated with the search terms provided.

There are a number of benefits to using semantics, both for the host of the searching system and for the knowledge worker. The end client is provided with a potentially faster and more focused search. The search engine host benefits because the extra investment in computation at the start of the search can reduce the amount of computational requirements for completing a successful search, thereby providing greater search volume.

Are there problems with this idea? Absolutely. It is possible that words are used with completely proper meanings, yet the tone of the document deals with a very different set of concepts, resulting in improper classification. However, this kind of semantic approach should evolve over time and yield a more refined classification environment. That should give some comfort to the poor sap whose strength has been sapped as the result of that SAP migration.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access