MAY 6, 2009 3:20am ET

Related Links

10 Sustainability Predictions for 2011
February 23, 2011
A Letter to Future Employees: Embrace Analytics
February 3, 2011
A Hunger for Risk
January 6, 2011

Web Seminars

Why Getting Started in MDM Doesn't Have to Be Difficult
February 29, 2012
How to Narrow the IT/Business Communication Gap
March 21, 2012
Deliver Better Enterprise Data through Better Reference Data Management
Available On Demand

Too Much Data? Deal With It ...

Print
Reprints
Email

I recently heard a worrying statistic : It is predicted that by 2012, the amount of data being stored will double every 11 hours. I have to say that I view this with a healthy dose of scepticism, but whichever way you look at it, there’s still going to be a lot of data around.

Standard approaches of trying to deal with rapidly expanding data volumes have been met with relatively low levels of success – trying to get people to stop copying files to multiple different storage devices, sending files to large distribution lists via email and getting people  to only use centralized storage just does not match with people’s own ways of working, so we do tend to end up with large duplicate amounts of data being stored.

Data deduplication has been around for a while, but it still has to reach a reasonable degree of use in the mainstream. Even where it is in use, it is often misunderstood and poorly implemented.

So, what can we do with data deduplication? At the most basic level, we can remove all the copies of the files that we have in the system. Here, files are compared against each other, and where there is an exact match, the files are collapsed down to a single physical file, with use of virtual pointers to point to this file from other places. This can save the average company somewhere between five to 15 percent of its storage volumes.

This is, in reality, only scratching the surface. Let’s take a much deeper look at how an organization deals with its data and identify the sort of savings that are possible.

In creating a document, a user will get it to a level where it becomes a working draft. At this point, it needs to be sent out for review by others. Although workflow and document management systems are well-suited to this, very few documents at initial draft level will get into such systems, due to perceptions of high cost and overly formalized work practices.

Instead, the majority of users will tend to use email to send the document to a group of possible reviewers. For the sake of argument, let’s assume that there are four people in this group and that each participates in the review. Unless the review process is tightly controlled, each person will tend to review in isolation, saving a copy of the document to his/her drive and working on it. After the review is complete each will then send the document back to the original owner, who will then aggregate comments.

So, we start with one document. This document is copied as a single email attachment and depending on how the organization’s email system works, it will then become four additional documents – one in each recipient’s inbox. There will be four more copies when each person copies the attachment to his/her own file system as well as four modified files being sent back to the original owner, leaving four more copies in the inbox. The original owner will then make a copy of each of these and will make a new overall document reflecting the comments being made.

In one review, there could be 23 similar or identical versions of the one document. If we have three iterations of the review, we end up with 70 versions of the document being stored in different places around the organization. If we assume a file size of 200KB, we suddenly have 14MB of data being stored.

But, if block-level data deduplication is used, the content of files is compared at a more granular level, comparing blocks of data against each other. Therefore, if there are two documents where only some text has been changed, only these changes will be physically stored, along with a small amount of data that shows how the actual document itself needs to be rebuilt.

Block-level deduplication can also work against other types of data – image, encrypted, video and voice. An organization can expect to save around 60 to 80 percent of its storage volumes by taking such an approach.

Sound like a silver bullet? In many ways it is, but I also recommends a degree of caution. Unless business continuity is taken in to account, data deduplication can lead to massive problems. Let’s assume that there is disk failure or data corruption somewhere in the system. Many parts of the storage will be made up of data fragments and pointers, and it will be almost impossible to rebuild these so that data can be successfully rescued. You may have been thanked for cutting down on capital and operating costs for storage while it was all going well – now you’re suddenly the villain of the piece for not foreseeing what could happen.

But, if you are saving 70 percent of your storage volumes, mirroring the data still means a saving of 40 percent - and you will have created a solid data business continuity capability at the same time. Data deduplication is not all smoke and mirrors, it really does work, and newer approaches apply the approach across all data stores. Indeed, companies capture the data before it hits the storage itself, deduplicating it on the fly while enabling intelligent filing. Here, the data can be tagged as it is deduplicated, and the system can then make sure that it is stored in the right place, whether for governance and audit reasons or for cost reasons, using old

Filed under:

Advertisement

Comments (0)

Be the first to comment on this post using the section below.

Add Your Comments:
You must be registered to post a comment.
Not Registered?
You must be registered to post a comment. Click here to register.
Already registered? Log in here
Please note you must now log in with your email address and password.
Twitter
Facebook
LinkedIn
Login  |  My Account  |  White Papers  |  Web Seminars  |  Events |  Newsletters |  eBooks
FOLLOW US
Please note you must now log in with your email address and password.