I recently heard a worrying statistic : Itis predicted that by 2012, the amount of data being stored will double every 11hours. I have to say that I view this with a healthy dose of scepticism, butwhichever way you look at it, there’s still going to be a lot of data around.

Standard approaches of trying to deal with rapidlyexpanding data volumes have been met with relatively low levels of success –trying to get people to stop copying files to multiple different storagedevices, sending files to large distribution lists via email and getting peopleto only use centralized storage just doesnot match with people’s own ways of working, so we do tend to end up with largeduplicate amounts of data being stored.

Data deduplication has been around for awhile, but it still has to reach a reasonable degree of use in the mainstream.Even where it is in use, it is often misunderstood and poorly implemented.

So, what can we do with data deduplication?At the most basic level, we can remove all the copies of the files that we havein the system. Here, files are compared against each other, and where there isan exact match, the files are collapsed down to a single physical file, with useof virtual pointers to point to this file from other places. This can save theaverage company somewhere between five to 15 percent of its storage volumes.

This is, in reality, only scratching thesurface. Let’s take a much deeper look at how an organization deals with itsdata and identify the sort of savings that are possible.

In creating a document, a user will get itto a level where it becomes a working draft. At this point, it needs to be sentout for review by others. Although workflow and document management systems arewell-suited to this, very few documents at initial draft level will get intosuch systems, due to perceptions of high cost and overly formalized workpractices.

Instead, the majority of users will tend touse email to send the document to a group of possible reviewers. For the sakeof argument, let’s assume that there are four people in this group and thateach participates in the review. Unless the review process is tightlycontrolled, each person will tend to review in isolation, saving a copy of thedocument to his/her drive and working on it. After the review is complete each willthen send the document back to the original owner, who will then aggregatecomments.

So, we start with one document. Thisdocument is copied as a single email attachment and depending on how theorganization’s email system works, it will then become four additionaldocuments – one in each recipient’s inbox. There will be four more copies wheneach person copies the attachment to his/her own file system as well as fourmodified files being sent back to the original owner, leaving four more copiesin the inbox. The original owner will then make a copy of each of these andwill make a new overall document reflecting the comments being made.

In one review, there could be 23 similar oridentical versions of the one document. If we have three iterations of thereview, we end up with 70 versions of the document being stored in differentplaces around the organization. If we assume a file size of 200KB, we suddenlyhave 14MB of data being stored.

But, if block-level data deduplication isused, the content of files is compared at a more granular level, comparingblocks of data against each other. Therefore, if there are two documents whereonly some text has been changed, only these changes will be physically stored,along with a small amount of data that shows how the actual document itselfneeds to be rebuilt.

Block-level deduplication can also workagainst other types of data – image, encrypted, video and voice. An organizationcan expect to save around 60 to 80 percent of its storage volumes by takingsuch an approach.

Sound like a silver bullet? In many ways itis, but I also recommends a degree of caution. Unless business continuity istaken in to account, data deduplication can lead to massive problems. Let’s assumethat there is disk failure or data corruption somewhere in the system. Manyparts of the storage will be made up of data fragments and pointers, and itwill be almost impossible to rebuild these so that data can be successfullyrescued. You may have been thanked for cutting down on capital and operatingcosts for storage while it was all going well – now you’re suddenly the villainof the piece for not foreseeing what could happen.

But, if you are saving 70 percent of yourstorage volumes, mirroring the data still means a saving of 40 percent - andyou will have created a solid data business continuity capability at the sametime. Data deduplication is not all smoke and mirrors, it really does work, andnewer approaches apply the approach across all data stores. Indeed, companies capturethe data before it hits the storage itself, deduplicating it on the fly whileenabling intelligent filing. Here, the data can be tagged as it isdeduplicated, and the system can then make sure that it is stored in the rightplace, whether for governance and audit reasons or for cost reasons, using oldstorage as near-line storage for less important data. Further, as the data istagged automatically and is indexed against text contained in documents, youget a fully searchable, organization-wide data store.

If organizations do continue to see storagegrowth and if the statement at the beginning of this article is anywhere nearthe reality, then organizations have to do something. You can try educating theusers to be less profligate in their storage use, but it’s probably a lotsmarter to take control through automated means and finally grab athe problemby the horns.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access