© 2019 SourceMedia. All rights reserved.

Understanding Content Collection and Indexing

The ability to find information is important for myriad reasons. Spending too much time looking for information means we're unable to spend time on other tasks. An inability to find information might force us to make an uninformed or incorrect decision. In worse scenarios, inability to locate can cause regulatory problems, or, in in a hospital, lead to a fatal mistake.

Improving information "finability" isn't just a matter of buying search technology. It's a multivariate process, and a key part of enabling search technology to produce the right results at the right time is understanding how technology collects and indexes content. There are different challenges and approaches to collecting and indexing information.

Content Collection

  1. Within a given search technology, the content collection subsystem can "obtain" content. Existing content must be identified, copied from its location to a processing folder, then moved to a "to be indexed" folder. This is the traditional approach available within most search systems.
  2. The system administrator configures the servers with content to run a script and identify changed or new content, then copies that content to a new folder, processes the content to some degree, and then "pushes" the processed files to the indexing subsystem. This is an approach supported by systems such as Autonomy's IDOL and Microsoft's FAST.
  3. A script called a spider or crawler visits (on a scheduled basis) servers, folders or files. When a change or a new document is identified, the script copies the file to the index processing subsystem. This is an approach supported by virtually all search systems today, but it remains a surprisingly complicated exercise. Spiders can experience problems with session variables in URLs, JavaScript, Flash, video and forms. When improperly configured, a spider can chase its own trail through a series of infinitely recursive links.

Many systems let you mix and match these content collection techniques. When the system offers APIs or toolkit modules, you can develop highly customized content collection and indexing systems. For example, newsfeed content can be collected on a near real-time basis and incorporated into the system using specialized scripts. Today, hybrid content collection techniques are the norm so you'll want to think about how you might apply them (and at what intervals), as you attempt to improve the location of information.


Indexing includes processes associated with identifying the keywords in a document or processing document, or content metadata. Once keywords and metadata are identified, pointers are created to them. Metadata is particularly important for indexing non-textual assets, such as audio and video. A few search tools have specialized techniques for processing audio and video, but generally, you'll need to have consistent, explicit metadata around those assets for them to be properly indexed.

Linguistic, Semantic or Natural Language Processing

This means that the system tries to interpret a document, not just look for words and phrases. Approaches include technology that "reads" and tries to understand a document in a way that allows the software to assign index terms, prepare a short, professionally written paragraph, or recognize a key date in a document and mark that date with a tag, just as a human indexer might. Though results have improved in recent years, these approaches are relatively immature.

Classification, Taxonomies, and Ontologies

The thrust of these features is twofold. First, the system automatically places each indexed document in a hierarchical tree of categories. Later on, when the results are presented, users not wanting to use a search-box query can browse a list of categories and drill down through deeper lists, where available.

Second, the system provides tools, lists, libraries and administrative functions to support the process of classification. The tools and functions are used to maintain taxonomies, extract ontologies from indexed content or apply some other value-adding process. When the system manages the process, the work costs less than when a human performs it, but the results are not necessarily as accurate.

Metasearch or "Federated" Search

In this scenario, a search and information access system can either access indexes from other search products or trigger those search engines to provide results, which are then aggregated into a single, enterprise-wide set. This sounds quite attractive, especially in an environment where enterprises have licensed multiple search products or employ various content management systems that embed their own search facilities.

In practice, however, comprehensive federated search is quite difficult. As with "regular" enterprise search, security is an important issue, particularly when multiple indexes or engines have different security models. Also, it is extremely hard to deduplicate, merge and provide comprehensive relevancy rankings in a reasonable time frame for the searcher. A bottleneck in one repository can gum up the overall results. Nevertheless, there is promise in federated technologies, especially those that rely on other native search engines to provide the core results and serve as basic aggregators.

Document Warehousing

Processing subsystems in some search products can be configured to create and store versions of the indexed documents in a document warehouse. When a user queries a document or clicks on it from a taxonomy tree, the document is delivered directly from the repository, not from the computer on which the source document resides within the enterprise. The benefit from this "one repository" approach is that monitoring a document's access and delivery becomes easier and faster.

This collection of approaches really just scratches the surface of content collection and indexing and you should take the time to dig deeper. An understanding of the choices, benefits and downsides of each approach will allow you to better tailor your content and tune your technology to boost information "findability."

For reprint and licensing requests for this article, click here.