The myth of an über repository is finally busted. For a long time, we hoped that there could be one repository that could hold all of our unstructured data, also known as content. Not anymore.
Realization of this fact comes as our view of what content is and what can be done with it is gradually transformed. For a long time, content was defined by what one cannot do with it. We knew that content could not be stored in a relational database. So content had to be stored elsewhere. But where?
The answer appeared simple. If you worked in an enterprise environment and you had content, you stored it in an enterprise content management system. This is where you could consolidate content, and once content was consolidated, it could be controlled. Controlling enterprise content is very important, chiefly for compliance reasons. You do not want your content to be lying around willy-nilly; this can get you into all sorts of trouble.
This consolidated ECM vision promoted the über repository notion. In those early days of ECM, I was often asked if all my content was stored in one repository. If the answer was no, I was given a look clearly conveyed I was in deep trouble and needed urgent help, as did many other people. This help we received, and there is now an entire separate ECM industry dedicated to managing stuff that cannot be stored in the database.
Gradually, we are learning now that über repositories do not really work all that well. Yes, content must be controlled, and we definitely need ECM. Control, however, does not necessarily imply consolidation. There are several reasons why consolidating content in to one repository may be foolish, impractical, or even impossible.
First, there is just too much of it. The amount of content in the world grows exponentially. In the next two years, we will generate as much content as all of humanity managed to create during the entire history of humankind.
Second, we now know that there are many different types of content. There are documents that you share internally, content that you put on your Web site, high-resolution audio and video content that you stream to TVs or play on iPods and user-generated content in the form of short videos, photos, tweets, and comments.
Third, we now know that the way people and applications use content can also be dramatically different. In some cases, content is generated at a torrid rate, stored and almost never read. In other cases, content is created once, stored and then sent via millions of concurrent streams to people all over the world. There are an infinite number of variations between these two extremes.
For every different combination of content volume, type and usage pattern, one needs a different content repository. HDFS, for instance, is great at efficiently storing and retrieving very large files. Documentum and SharePoint are good at managing documents in a collaborative environment. Cassandra is good for managing user-generated content where a sustainable rate of updates is much more important than data consistency. When dynamic delivery of content to a Web browser or mobile devices is desired, then a content management system such as FatWire or Drupal, can be used.
The past few years show that specialized repositories have prospered, and we have more of them appearing every day, while über repositories that try to be all things to all people have not done well and have withered.
We accept the diverse multirepository world as a fact. This still leaves us with the problem of managing and controlling enterprise content that is distributed across disparate repositories. Several approaches to solving this problem exist.
One school of thought suggests that even if you cannot consolidate content, you still must be able to consolidate access to this content. First, you deploy a single content integration hub. Then you use adaptors to connect the various content repositories to this hub. Once the repositories are connected, one must be able to access and control content stored in all of them via a single interface exposed by the hub. This is the approach promoted by standards such as JSR-170 and JSR-286.
In practice, access consolidation turns out to be as difficult to implement as content consolidation. The integration hub becomes the single bottleneck that does not support all access patterns equally well. The broadly accepted conclusion now is that distributed content requires distributed access.
Two architectural patterns have become particularly popular for building distributed applications: service-oriented architecture and representational state transfer.
SOA says that the world is populated by services. Each service implements a well-defined interface, also known as the service contract. Applications can discover services. They can also invoke services by sending messages to their so-called endpoints. One can replace service A that has certain characteristics with service B that has a totally different set of characteristics. All that matters is that A and B implement the same service contract. In this sense, applications and services are said to be loosely coupled.
From the REST point of view, the world consists of resources. Each resource has a unique resource identifier (URI). Resources can be subject to a few well-defined operations, such as GET, PUT, UPDATE, DELETE, and possibly others. These operations are interpreted by servers. Applications refer to resources via URIs and may not be aware of their physical location. HTTP is often the protocol used to submit requests to servers and receive responses.
An application conforming to either SOA or REST principles can be used to effectively manage and control distributed content while requiring neither repository nor access consolidation.
A content repository can publish a service endpoint. This endpoint can be used by an application to manage content. Later this repository can be replaced with another repository conforming to the same service contract. The runtime characteristics of the application may change, but not its interface. With this approach, an application must also be able to access multiple repositories, each with a different set of characteristics, in a consistent manner.
Similarly, digital assets can be mapped to REST resources that are managed by servers that map to content repositories. The location and type of repositories is absolutely transparent to the application that is accessing the resources. Different repositories can be accessed by an application in a transparent manner. They can be used to manage different types of content and implement efficient support for the various access patterns required by the application: read mostly, write mostly, large file streaming and others.
The content management industry has enthusiastically embraced this vision of distributed and diverse content management. IBM, EMC and Microsoft bootstrapped an effort to develop a new standard for content interoperability based on these principles. The standard is called CMIS, which stands for Content Management Interoperability Services. It is currently undergoing active review and development in OASIS and is expected to be finalized within the next six to eight months. The number of companies participating in this work has grown dramatically.
CMIS defines a unified model for describing content resources and repositories that manage them. It also offers bindings of this model to the SOA and REST architectural patterns. There is a lot to like in this new standard.
To users of a Web content management (WCM) product, it promises transparent access to content stored in the various content repositories directly from a Web site authoring tool. An image can be placed on a Web page and served to billions of Web site visitors regardless of where the original copy of this image was physically stored: in a document management system, in a collaboration workspace or in a digital asset management system. In this case, a WCM system acts as a CMIS client with transparent access to disparate content repositories deployed in the enterprise.
A WCM can also be a CMIS server. This is particularly useful when content published on a Web site is syndicated to other Web properties and applications. With CMIS, every image, video and news story published on a Web site becomes a Web resource that can be found, retrieved, manipulated and reused in numerous gadgets and mashups. This is where the distributed nature of the standard and its native support for the popular Web protocols is especially helpful and valuable.
I encourage every Web application developer and every enterprise architect to read up on CMIS and do some experimentation. An early draft of the standard has been posted for public review. There are a number of open source implementations available. Support for the standard is also likely to start popping up in a fair number of established enterprise content management products in the near future.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access