In a world where organizations grapple with exponential growth in unstructured content, there is an increasing interest in structured content and its management. This article examines the differences between structured and unstructured content and makes the case for separate but cooperative content management systems. It also identifies a sequence for adopting a structured content environment that should counter any irrational exuberance for the nascent paradigm shift.

Structured Content Examined

Structured content, as a phrase, most often means text and graphics wrapped in eXtensible Markup Language (XML). Strictly speaking, it doesn't have to be XML - any markup language could be the wrapper. But today, XML is the lingua franca of structured content interchange between various systems, so the common definition is sufficient. Within the family of structured content are two subspecies: 1) XML content conforming to a structural specification and 2) any other XML content. The distinction implies that any XML content conforming to rules found in a document type definition (DTD) or an XML schema is truly structured content. Anything else is more or less text and objects (perhaps pages and pages of them) surrounded by one or more XML tags.

The Darwin Information Typing Architecture (DITA) and DocBook are two examples of standards-based specifications for creating rule-driven structured content. DITA's foundation is topic-style authoring, akin to an application's help file where you select a subject or keyword and look at a step-by-step list of instructions. DocBook's foundation is book based. It has rules for chapters, sections and paragraphs. When writing structured content that conforms to DITA or DocBook, for example, there are restrictions as to what kinds of "elements" can be composed or incorporated at certain points within the text. The rules indicate the relationship between various elements and give a very good idea about how elements can be reused in other places. In this way, DITA and DocBook - and other structured content standards - promote content reuse by allowing pages of text to be broken into self-contained sentences, paragraphs or chapters that can be used in any number of finished documents.

Unstructured Content Examined

The reuse concept isn't practical with unstructured content, however. Unstructured content usually refers to whatever anyone writes, draws or otherwise composes using their own ideas about the rules for comprehensible writing, artistry or composition. This implies that email, text messages, instant messages, Web pages, phone conversations, graphics, word-processing documents and virtually anything that can be composed or assembled according to one's own "rules" is unstructured content. A Microsoft Word 2003 document of text and graphics is a good example of unstructured content. Aside from an author's own opinions and the limitations of the Word program itself, there are no other rules per se about what can be written in Word. To reuse unstructured content, the targeted information must be plucked from within the text and recomposed to fit the context of the destination document. Reusing content from many unstructured documents is very time-consuming.

Choosing the Right Content Management System Solution

As more content-authoring teams consider enterprise content management (ECM) solutions to manage a widening expanse of content, the question of which type of content is produced - structured or unstructured - becomes more central to the decision for selecting a content management system (CMS).

Consider the production of structured content. Structured content authoring forms the basis for high levels of content reuse. It takes less time to locate and reuse high quality content than it does to create new, high quality content. As time affects both cost and schedule, high levels of high quality content reuse drive higher levels of overall content output at overall lower cost. In other words, structured authoring actually drives down the overall cost to produce high quality documentation but only when the organization employs high levels of content reuse. Naturally, in this case, you would want to select a CMS that promotes high levels of reuse through both its user interface and its underlying architecture. This most often means that to manage structured, XML-based content, you need a native-XML CMS.

Of course, someone will ask, "How granular does my object model need to be?" There must be some operations that favor reuse of entire documents and other operations that favor reuse of paragraphs and sentences. Answering this question indicates when a native-XML CMS solution returns the highest value.

For example, in older knowledge management paradigms that focus on publishing, posting or otherwise delivering static content, the granularity requirements are low. A whole document or set of pages from a document is sufficient to support the knowledge management system. Therefore, a document-based CMS is a good fit for these types of solutions. Conversely, progressive organizations managing highly dynamic content - for instance, a company that allows its customer support team to submit customer feedback directly into the product documentation - need a highly granular content model to track content changes at the paragraph or sentence level. Marketing communications departments, for example, would need sentence-level granularity because they constantly snip sentences from different sources to build data sheets, product brochures and similar nonnarrative collateral. They would need to know when those sentences change so that they could update their collateral accordingly. Companies that operate with this kind of efficiency really need an XML-based CMS solution because the architecture enables consistent performance at any level of content granularity. Overall, a decision to adopt structured authoring of dynamic product content augurs a paragraph- and sentence-level native-XML CMS.

Authors or systems producing unstructured content - or content wrapped in XML without any governing DTD or schema - are aligned best with a document-based CMS. These CMS solutions offer basic content services: check-in, check-out, file-level versioning, highly effective access control, workflow routing and so on. Also, a document-based CMS is an ideal platform for plugging in a records management option, adding a form-based routing capability (such as moving an insurance claim form through an approval process) or connecting to a traditional data warehousing system. Vendors of these kinds of ECM systems have delivered enormous value to their customers by exploiting unstructured document techniques and technologies.

Beware of vendor hype around the storage of XML content as if storing structured content implies the ability to support highly granular content models and high levels of reuse. Document-based CMS systems simply do not have the architecture to support a structured content paradigm. They are usually based on relational database technology, whose performance decreases exponentially as the number of content elements increases incrementally. Translation: a good fit for document-level and page-level management, and a very poor fit for paragraph- and sentence-level management.

A native-XML CMS has a fundamentally different architecture. A primary characteristic is that performance decreases only linearly as the number of elements increases incrementally. This kind of system is well suited to a structured content paradigm where millions of XML fragments are reused in all kinds of output documents because it suffers very little performance degradation in the process of traversing those fragment relationships. Furthermore, an XML CMS natively maintains content integrity and relationships regardless of the frequency of object reuse, whereas a relational database architecture must map those relationships into intermediary tables that grow exponentially with each new reuse.

Five Steps for Moving to a Structured Content Environment

There is a recommended, sequenced approach in shifting to a structured content environment:

1. Measure your organization's content creation process before adopting structured authoring. Some good metrics are readability (there are automated tools that generate a readability score), the number of factual errors (expressed as bugs against the documentation), and how much time the authors work to produce a document. These indicators are all numeric, so they can feed easily into your own calculations of trends and anomalies. They are also important to executives because they predict both customer satisfaction (derived from readability and accuracy) and cost (derived from time). In addition, they can be measured automatically ( it might require a little help from the IT organization to set up automatic collection). Ideally, these numbers would go into a dashboard and be tracked with sufficient frequency to see the needles move as structured content authoring is incorporated into the business process.

2. "Up-skill" the staff. The authoring team will need a survey course in structured authoring (e.g., a seminar on DITA) followed by detailed training on an XML authoring tool. A few days on minimalism are also in order because content reuse is easier when the source text is concise. This training should precede any further steps toward adopting structured authoring or selecting a CMS solution. It takes time to acquire the discipline of minimalism, and it takes effort to develop the habits of content reuse. Also, structured authoring tools are not one-size-fits-all. A tool optimized for DocBook does not always work best with DITA, for instance. So pay some extra money to have a consultant help you select the right structured authoring tool. Once the tools and skills are in place, the selection of a structured content CMS vendor will be incalculably easier.

3. Choose a CMS optimized for structured content reuse and get the CMS vendor to agree in writing that it will integrate with your corporate ECM solution (if you have one). This point cannot be overstated. Structured and unstructured content live side by side in every organization, and the systems for managing the former are not that good at managing the latter. By choosing a solution that coexists with your ECM hardware, software and derived processes, you endorse the ECM's strengths and acknowledge its weaknesses by introducing a native-XML CMS. You get the best of both worlds with two systems working together - ideally through Web-based services - and you leverage a substantial investment that is already valuable to your company. If you must replace your ECM, never violate this dictum: XML fragments work best with native-XML content management systems. Anything else is a compromise, which will be expressed in disappointing performance metrics.

4. After six months of structured authoring, compare your organization's performance measurements before and after adopting structured authoring. Recent survey data indicates that best in class companies are 46 percent more likely to author structured documentation.1 These companies meet documentation deadlines 92 percent of the time on average, take half as long as some others to translate product documentation content, and make two-thirds fewer documentation changes after a product release. Not surprisingly, best-in-class companies identify content reuse as a measure of readability. In general, simple metrics that are easily measured and tracked over time are compelling evidence of the payoff from structured content authoring.

5. Adopt standard DTDs and XML schemas. In the days when structured authoring was first practical to implement, the only option for developing a set of content rules was to build one. Large organizations, such as governments, military contractors and multinational corporations, could afford to take 18 months for this exercise, absorbing many missteps along the way. Today, DocBook, DITA and similar standards take much of the effort and guesswork out of building those rules. Furthermore, these standards are available for free or at nominal cost. They represent the accumulated experience of hundreds of structured content professionals and have been tested widely across disparate industries. Save yourself a lot of headaches and money - go with a standard DTD or XML schema for structured content.

As content environments evolve into a commingling of structured and unstructured content, evaluating a hybrid content management approach makes the most sense. For those organizations that adopt structured content creation to take advantage of the benefits of highly granular reuse, a native-XML solution is the best fit. Likewise, a document-based CMS is well suited to managing content with low levels of reuse. Nevertheless, always manage structured and unstructured content within a conjoined technology framework.


  1. "The Next Generation Product Documentation Report: Getting Past the 'Throw it over the Wall' Approach." Aberdeen Group, December 2006.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access