Recent research shows that one in four companies, regardless of size, will be involved in legal proceedings each year. One in four. The explosion in electronic documents over the last decade, combined with increasingly aggressive legal discovery practices, is creating a mountain of data that needs to be exchanged between parties during this litigation. Obviously, this is a tremendous load on the legal department (and their budgets), but at some point in the process, that mountain of data falls squarely onto the shoulders of IT managers, as corporate legal counsel turns to them and asks, “How do we get a terabyte of data into the courtroom?”


At which point the IT manager gets to smugly answer: you don’t.


Nowadays, of course, it’s possible to bring a terabyte (TB) of data anywhere, and rather easily. While you probably can’t fit a TB in your pocket, chances are you can carry it under your arm or in a briefcase. But that’s not the point. It’s not the cost of the hardware that matters, but the cost of the data; or, more specifically, it’s the cost of the data review.


Many companies that are pulled into e-discovery for the first time have the impression that they need to search, save, gather and bring with them copies of literally everything they have. And where does that data come from? The sources that feed into e-discovery can potentially be any data, electronic or otherwise, that’s owned or controlled by an entity, whether that entity is an individual or a corporation. (These entities are often referred to as custodians in legal circles, as they are in “custody” of the data.) And that data can be anywhere: in the data center, in off-site archives, on desktop hard drives, CDs, flash drives – or even in file cabinets.


In today’s email-centric world, the vast majority of that data resides in employee mailboxes. While corporations often limit the size of an individual mailbox to prevent “packrat-itis” on the server, employees are extremely adept at getting around those limitations by archiving data on the desktop, oftentimes for years. Even if a company has a good strong retention policy in place; those policies are rarely enforced in a comprehensive manner. Once a suit is pending, it’s way too late to go back and suddenly “enforce” the policies without serious legal hot water. As a result, if an employee has been at a company for a number of years, they most likely have email from each of those years in their possession – outside the control of the IT group, but within their responsibility from a legal standpoint.


In most instances, the IT department will have much of this data stored somewhere in an archive. Perhaps it’s a sophisticated and searchable email archiving environment. Or, just as likely, the data is stored on a multitude of archived tapes that have been cut every month, quarter or year. Thus, a single email message from several years back could reside on the email server, on each of the last four daily tapes, on the last weekly tape, on each of the last quarterly tapes and on the preceding several annual tapes, as well as on the desktops of any employees that were involved in the mail chain.


That’s a lot of data. The paranoia that has been drilled into IT groups about the potential for a catastrophic loss of data actually works to their disadvantage because in a litigation setting each one of those copies needs to be identified, reviewed and categorized. This could add up to hundreds of gigabytes of data per employee; consider it the dark side to backup and archiving.


So back to the original question: how do you fit a TB into the courtroom? The true answer is that you don’t need to.


The strategy instead is to effectively cull that information so that the legal decisions regarding what is relevant or not are focused on useful data. In other words, the goal is not just to limit the amount of data, but also to limit the number of decisions that have to be made regarding that data, thus minimizing the time, effort and legal expense around discovery.


In Search of the Duplicate


There are a number of tricks and tools an IT department or forensic consultant can use to identify and procure only the relevant information for review. The central concept in this process is the art of deduplication: finding the one instance of a file per custodian that is appropriate for review. (Note the “per custodian” part of that statement – the importance of a file can vary greatly depending on who had it. Was it just the CFO, or an entire department, for instance?) In short, we want to be able to identify a single email that is in 15 different archives and on a dozen desktops, and review it just once, while ignoring all of the files that have nothing to do with the questions at hand.


The first step in this process is a technique called known file exclusion - using an authoritative source to define files that can safely be ignored. One of the standards in this instance is the U.S.Government’s National Software Reference Library. This database contains the electronic signatures of millions of files that are known to be part of software applications or related sources, such as help files, documentation, executables, etc. In most cases, these files can all be ignored.


The next strategy is to take this same technique and apply it to the files that are specific to a corporation (internal applications, help files, etc.). The IT department can compile their own database of these files to be excluded.


So now we’ve eliminated the known industry- and company-specific files from consideration, slowly whittling down the vast mountain of data that will need to be presented to attorneys for review.


From this point forward, the IT group will need to consider specialized e-discovery tools to get the amount of data down even further. One of the most effective techniques these tools can apply is a process known as near-deduplication, basically the art of excluding email chains. In near-deduplication, the tool reviews an email exchange between two or more people, dissects each message instance, and determines if all of the preceding messages in the string are wholly contained in that one email. This process helps identify the last message in a given chain of, say, 20 emails, and if that final message contains the entire context of the conversation, exclude the previous 19 copies. When applied to long email threads that have a large number of recipients, this technique can be very effective in limiting the amount of data for review.


Once the data is as clean as the IT group can reasonably make it, the time is at hand for human legal review. Fortunately, technology can still help at this stage; the focus now turns to helping organize the data for review. The objective is to help the attorneys better understand or recognize the context regarding document contents so that they can make faster and more accurate decisions, thus reducing the total cost of the project.


Reading the Documents


The only way to effectively organize a large body of documents is to read them first. That’s where content analysis comes in. Content analytics: extracting information from a document to determine what the document is about. Of course, software tools don’t know (or care) what any given file is about, but these tools can certainly look at the nouns, noun phrases, etc. and identify the main topics of the document. This allows the software to locate other documents about the same or similar topics and organize them so that the human reviewer making decisions about them is doing so in the most effective way possible.


Thus, the tools will analyze all of the documents in a case to identify how they’re related and organize them for efficient review. Instead of reviewing a sales forecast followed by a software bug report followed by a memo about the company picnic, the attorneys can now review all of the documents about sales or software development or the company picnic. This creates context and, thus, efficiency.


Think of it as creating the virtual equivalent of a stack of papers for review. Unlike paper, however, technology allows the reviewer to follow a chain of thought through electronic copies. The reviewer can then dynamically reorganize the documents into different or more detailed stacks.


Slowly but surely, these tools and techniques can whittle down the amount of data that needs to be taken into a courtroom – often by a factor of 10 or more. Certainly this can all be done manually, and, in fact, was always done by hand in the past. With the exponential increase in the amount of email and other content in the world, technology is obviously the only way to address increasingly detailed and burdensome review requirements.


“Talk to me, Goose. Talk to me.”


You might think you’re Maverick, but don’t try to go it alone on these projects. Document everything and make sure your legal department understands how you are culling and filtering data (and get their approval in writing). You might not be court martialed, but a “flame out” on a project like this won’t launch your career like Top Gun did for Tom Cruise.


So in a very real sense, the IT group might be able to use technology to help win the case or even prevent it from ever going to trial. And that is certainly the very best way to keep a TB of data out of the courtroom.


Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access