(Part one of a two-part article)

Recently we introduced the phrase “extreme archiving” which sounds like a nice marketing term but what it actually means is that there are more extremes to consider when it comes to managing structured and unstructured data.

Here are four reasons to consider extreme archiving today:


Highly regulated industries such as financial services, healthcare and life sciences are faced with increasingly tighter regulations around data collection.

For example, Dodd-Frank in the U.S and MiFID II in the EU require financial services firms that provide clients with financial instruments such as shares, bonds and the likes, to collect all transactional and communications data including email, Instant Messaging threads, documents, voice recordings etc., in an archive that needs to be indexed and discoverable within a reasonable timeframe.

Banks in the U.S alone were faced with hundreds of billions of dollars in fines for not being compliant with Dodd-Frank, and are still struggling to keep up with the ever increasing data volumes and costs for maintaining archives.


Data volumes within the enterprise are exploding, and managing this type of volume leads to many challenges.

Imagine a global bank with 200,000 employees falling under SEC regulations (emails need to be retained for seven years). Assuming an average of 200 emails per employee, archiving email alone leads to 7x365x200,000x200=1E11 emails under management. Average email sizes, largely due to attachments, are about 200Kb leading to a raw data volume of 1E11x200,000=2E16 bytes or 20Pbs of data.

Couple this with Instant Messaging, which has grown exponentially over the last couple of years. Instant messages may not be as large as emails, but in a regulated industry each message is a considered a record and needs to be saved (including its metadata where the metadata might be larger than the message itself).

You can see how enterprises can easily have tens of petabytes of unstructured data and hundreds of billions of transactions, and that these transactions may need to be retained for long periods of time due to the increasingly strict regulations mentioned above.

In financial services, retention periods vary between seven and 12 years, depending on the country and data type. In industries such as healthcare and defense, this can easily go up to 100 years. Consider the extreme impact of petabytes of data for hundreds of years.


Although somewhat counter intuitive due to the previous point, the size (cost) or footprint of an archive solution is important. Having a lightweight architecture is becoming increasingly important as new privacy laws force global enterprises to set up highly distributed archiving with geo-fencing of the data in specific countries.

Where you could have a global archive in the past with only three server locations (e.g. Americas, EMEA and APJ) customers are now faced with the fact that they need a lot of instances in different countries. The cost of managing such a global archive goes up significantly, unless your system is lightweight and easy to deploy, with the ability to handle elastic scaling.


More and more organizations are discovering the untapped potential of their historical data. With new analytics tools and machine learning capabilities, you can discover hidden gems in your archive.

The success of machine learning over unstructured data (text) heavily depends on the quality of a training set. Most machine learning algorithms are trained on Wikipedia since they need a corpus of 10-100 billion documents to become more accurate. That can be challenging since the number of documents within a large enterprise rarely exceed 10b documents. Training your machine learning algorithm on Wikipedia means that your machine learning algorithm doesn’t contain the corporate DNA, which will result in more false positives.

In my opinion we will soon discover that email and IM archives will become the source for training machine learning algorithms for unstructured documents turning your compliant archives into a source of smartness and actionable intelligence.

Extreme archives not only manage complex compliance requirements but offer the ability to give business users and data scientists’ access in a controlled and compliant way. Not only do we see a growing need for analytics over historical data, we also see a need to give 360-degree views over all data to end-users. This significantly changes the access load on your archive. Smart archives are able to scale dynamically when access patterns change over time

Next (Extreme) Steps

In my next article, I’ll share four features you need to look for in an extreme archiving solution to address many of these challenges.

(About the author: Jeroen van Rotterdam is chief technology officer, vice president and distinguished engineer in the Enterprise Content Division at EMC)

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access