Continue in 2 seconds

Meta Data Crawlers

  • February 01 1999, 1:00am EST

The purpose of meta data is to help business analysts understand their decision support data, form better queries and get better answers. Like the data it describes, meta data is usually stored and accessed in an organized fashion, and its effectiveness is based on its quality, its quantity and especially on its accessibility.

Efficiently accessing or searching meta data can be challenging. Although keyword searches are simple, they are often inefficient ­ producing unpredictable and overwhelming results when the keyword has multiple meanings. Frequently, the analyst is left to pick through extraneous, irrelevant information or compose elaborate compound keyword searches (if the meta data search engine supports it).

For example, suppose I'm searching for information on "boxers." Do I mean dogs, pugilists or undergarments? If I can't specify the context, then I'll probably get information on all three. If only I could search my meta data by context. But how?

The other day, my daughter came home from first grade just ecstatic. She had learned a clever way to figure out the meaning of a new word ­ look at the other words around it. The answer, then, is to expand the scope of the meta data search criteria from a keyword to a question, and then evaluate the potential relationships between the words that form the question to determine the context.

Although most of today's meta data browsers don't support context-based searches, a few vendors are starting to offer tools with intelligent browsing capabilities. (See Figure 1 for examples.) Many of these new search tools can be found in electronic document management systems (EDMS) technology and utilize the previously mentioned logic.

Company Name Product Web Site
PCDocs Fulcrum
Documentum Relevance
Figure 1: Tools with Intelligent Browsing Capabilities

First, they break down a question into its components (individual words, compound words, phrases, etc.). Next, to get an understanding of the question's true meaning, each component is examined in context with the surrounding components, and each likely meaning of the component is tested against a stored lexicon. Finally, a context-based search is performed. This is equivalent to removing static ­ the unnecessary noise of irrelevant information. The result set is minimized to include only those items matching the context of the components. For PCDoc's Fulcrum, the process that reduces the search to relevant, in- context information is called (hang on to your hat...) "Semantic Disambiguation."

So, using the boxer example again, I could specify that I want meta data on "boxer gestation period and average litter sizes." Oh, now you get the idea that I am talking about dogs, not underwear or fighters. How? Because we know that in the context of litters and gestation, the other two meanings are inappropriate. So the search begins with a much clearer focus.

The strength of these tools is in the direction of their search algorithms ­ focusing less on full-text searches and more on smarter, inferential searches. These tools are able to understand the context in which the user of the Corporate Information Factory is working (the business process, task, mode of operation, etc.), thereby organizing and delivering only the information that is required and disregarding the rest.

Where the Corporate Information Factory will be in five years is hard to predict. Without a doubt, it will continue to expand in terms of the types of data captured ­ structured (e.g., calculations, derived fields, dimensions, and so on) and unstructured (e.g., e-mails, Web site information, comments and expertise from your analysts, etc.). Terabytes and terabytes of information will be made accessible to our business users. The ability to make sense of all this information will be greatly enhanced by the ability to reduce searches to what is relevant to the person using the environment.

Even more important though may be the future ability of these tools to unearth implicit concepts, relationships and content not immediately obvious or explicitly mentioned. For example, perhaps I am truly interested in raising boxers for show but didn't explicitly state it. By analyzing my questions about the litter, gestation and other factors in the query process, these tools will further refine and narrow the search, giving me just what I need ­ information about dog shows, winning boxers and their litter mates.

Certainly with usage comes a lot of sophistication and refinement from our business community ­ and a reluctance to change. Acceptance of this new technology won't happen until the performance of the tools overcomes the cultural resistance to using them. But I think we can envision a day when our tools are smart enough to cut through the superfluous amounts of information, delivering just the right stuff at the right time.

Perhaps in the not-so-far future, these tools will start to anticipate our requests, based on our recent activities, creating queries and result sets while we sleep. It could happen.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access