New York Times online readers see metadata in action every day when they log on to to navigate the Times Topics. By using mostly automatic metadata generation, combined with manual assignment of metatags, many large publishing companies like The New York Times Corporation have gotten a handle on the massive amounts of content that passes through their content management systems each day. During the course of a day, hundreds of beat reporters file stories on various topics such as the U.S. military, foreign affairs, arts and entertainment, sports and more. It would be extremely difficult to manually tag each article that is filed by a reporter at a large publisher. By deploying a semantic system that understands the relationships between words, publishers succeed by automating the system, and use checks and balances from humans as a secondary layer for accuracy. Similarly, enterprises can take a cue from the way in which publishers assign metadata, because information is flowing from multiple business units and requires uniformity across the organization.


The New York Times has a novel approach to achieving uniformity across all of their content, which takes metatagging a step further. By building in topics for its readers and subdividing these topics by major figures in the articles, historical context, companies and events, the publisher can make it simple for the end users to find what they want. This, in turn, drives more users to the site, and makes the offerings more attractive for advertisers that are flocking to new media as an alternative to traditional print advertising.


Let’s take a sports article on the U.S. Open as an example. Since there are two different events that take place under the title of “U.S. Open,” first the metatagging system must determine whether the article is referring to golf or tennis. This is fairly simple and determined by the presence of the word “golf” or “tennis” in the article, which is then tagged automatically by the metadata management system. Then, the article must be further categorized by the date, major figures in the tournament, previous winners and more. If Tiger Woods were leading the U.S. Open, the system would tag his name, along with a term such as “leader” to drive that article to the top of the list when an end user queries a phrase such as “2007 U.S. Open leader.” If searchers did not define whether they’re looking for golf or tennis, they’d receive results on both sports separated by topic. Then, they could narrow down the search results and find the exact news article they’re looking for. Similarly, within an enterprise, multiple product lines can be automatically tagged and categorized, easing the enterprise search process.


Taking the Mainstream In House


The most important theme in the publishing example above is bringing structure to otherwise unstructured data. The same problem exists in the enterprise. In a step-by-step approach, knowledge workers can help enterprises maintain uniformity and ease the overall search and retrieval process. Deploying these processes can address the organization’s need to:

  1. Keep internal documents up to date and uniform across the enterprise.
  2. Drive more traffic to customer-facing Web sites.
  3. Reduce frustration of employees searching for enterprise documents.
  4. Manage the deletion/retention of documents as content is merged across different sources (people, divisions or even companies).
  5. Compare similar documents as content is created by different sources (reduce duplication).

The first step toward organizing enterprise data involves the tagging of each document with respect to the metadata or terms that are relevant to the enterprise. At first thought, the tags associated with the documents can be created in an unmanaged way or organically created by each of the document’s editors.


Manual tagging may have some appeal to the enterprise because it does not require any controlled environment, however, it yields a very low recall. There is no guarantee that documents associated with a tag are also associated with other closely related tags. For example, in such an organically grown set of tags, the term “managed healthcare maintenance organization” may not be associated with the tag “HMO,” and, therefore, documents associated with one tag will not be associated with the other - yielding poor recall in enterprise searches. Although low recall may be useful in Internet applications or popularity-driven media like del.ic.ious, it is not acceptable within an enterprise, where both the recall and the precision of document retrieval are important.


Therefore, there is a need for a set of tags that are managed and controlled centrally at the enterprise-wide level. Specialized tools for term (ontology) management and taxonomy management are critical to the success of the implementation of enterprise-wide metadata. Such tools allow a knowledge officer to maintain, test and deploy a set of metadata tags, which need to be imposed on all documents within the enterprise, in parallel to or in conjunction with an enterprise-wide content management system.


When the set of metadata tags has been defined by a metadata management tool, newly created documents go though the step of the automatic creation of the metadata. The creation of metadata involves the use of a metadata generation server program that can be accessed via various programming interfaces such as JAVA API or SOAP APIs. The metadata generation server API will pull documents from the document source (disk share, document management system or content management system) and produce the metadata automatically for each document. The document metadata can be stored either in a metadata repository (database associating a document identifier to the metadata), stored along the document in a document management system, or stored within the documents in the case of a structured document (for example, XML, SGML, etc.). If in the metadata, methodology has not been implemented from the beginning, a one-time step of retrospective indexing is required. In this case, all existing documents are processed in a large batch, and metadata is created for each of the documents.


Once the initial retrospective indexing has been performed, the normal case of ongoing metadata generation as documents are created is used. In all cases, speed, scalability, accuracy and ease of maintenance of the metadata programs chosen are key criteria for choosing the right technology for your enterprise.


Metadata in the Limelight – Tagging, Social Media and Motivations for the Enterprise


Often within the enterprise, the question is – who will push the adoption of text analytics and business intelligence tools? The answer, in many cases, is the Generation Y population. As this younger generation moves into the workforce, mainstream technologies they deploy in their everyday lives will be looked at as a priority for enterprises. Instant information access has become a societal norm for Web surfers, and enterprises will soon find themselves being held to similar standards. Here are a few examples of metadata in the limelight.


Social bookmarking sites like del.ic.ious allow users to tag and index online content as they see fit, putting the control into the hands of the masses and allowing them to drive traffic to the content that they find the most interesting. Many blogs and news web sites allow users to automatically add their content to del.ic.ious in hopes that it will reach a mass amount of people in a short amount of time. This is all done through the assignment of tags and sorting of articles into categories or topics. The only difference is that del.ic.ious stories are ranked by popularity within a category – since a major function of social bookmarking is to drive public opinion to the forefront.


Tagging in the del.ic.ious arena has occurred very organically and spread quickly. Blogs, in turn, have adopted tagging to organize posts, drive topic-based search traffic to their sites and link to other popular blogs.


The mainstream adoption of metadata generation and text analytics tools can be considered an offshoot of the pop-culture need to get a handle on massive amounts of information. Taking lessons from high-traffic Web sites like the New York Times and del.ic.ious can never hurt. But, if, as businesses, we are able to provide our customers and employees with the information they need in a timely manner, then we have done our jobs properly.


Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access