Discovering information relevant to the subject of an inquiry or dispute is one of the core elements of the investigative and litigation process. Today, that discovery is primarily of the electronic kind. In fact, “e-discovery” has become one of the prominent concerns in the legal community and, as most attorneys could attest, is far from easy.


The explosion in the amount of email and electronic documents that the average company generates and the wide distribution of such documents across the corporate enterprise make it challenging to identify potentially relevant information. Most e-discovery searches today use software to “machine read” or index the text in a document, and a set of simple keywords - devised and agreed to by the parties in a dispute - to sort a company’s cache of digital documents into those containing search terms (i.e., “hits”). The problem is, while this may be a major improvement over a manual review of every document, it’s less effective than commonly believed.


For instance, in studying a case involving a San Francisco subway accident, Blair and Maron (as reported in the August 2007 Sedona Conference Journal) revealed that only 20 to 25 percent of relevant documents were found by attorneys’ e-discovery efforts. This is far different from the 75 percent that attorneys themselves believed they had found. Similarly, the Text Retrieval Conference (TREC), sponsored by the National Institute of Science and Technology (NIST), found that Boolean searches - as tested by participating computer scientists and mathematicians - returned just a fraction of the relevant information possible.


The primary reason for these discrepancies is that even experienced legal-search professionals cannot come up with an all-encompassing list of the words and phrases that parties involved in cases may use to describe events or things (e.g., referring to accidents as a “disasters,” “events” or “unfortunate incidents”).


Fortunately, e-discovery technology has kept pace with the growth in electronic document volume. New and advanced technologies and methods now enable attorneys to not only identify a greater percentage of potentially relevant documents, but also to present (i.e., visually organize) those documents in a way that promotes more efficient and accurate review - thus, substantially reducing the probability that information useful to one’s case is overlooked.


These methods endeavor to overcome search term shortcomings through algebra, linguistics, statistics and iterative learning. Applied scientifically, these methods lend themselves to the quality assurance step of estimating error rates (i.e., false positives and negatives) and are designed to build upon and enhance the foundational work of attorneys brainstorming search terms at a whiteboard.


Based on more advanced mathematical models, these tools help overcome the deficiencies of search terms limited by attorneys’ inability to know every possible colloquialism and variation that people may use to describe something. Some of these methods are as follows:


  • Algebraic methods include extended Boolean and proximity operators (for example, big w/5 red w/5 (car OR automobile)) that try to better emulate human language and improve search precision and recall. The TREC study found that, in some cases, extended Boolean outperformed other Boolean twofold.
  • Linguistic methods select, classify or categorize documents based on a supplied taxonomy or thesaurus. For example, a search for the word “car” may be automatically extended to include many synonyms, such as automobile, bus, hatchback, ride, subcompact or wheels.
  • Statistical methods may select, classify or categorize documents based on frequency or probability theories. A conceptual search, for example, may return documents about the people, places or things statistically correlated to a document with a search term - even though the document may not contain the search term.
  • Iterative learning methods may be employed by expert researchers in a variety of ways, including building a keyword or phrase list from important words and phrases identified during a pilot review of key custodian’s documents. These the key terms may then be applied to retrieve the documents of secondary or peripheral custodians.

An example illustrates how these methods can be combined in an effort to improve precision and recall over a simple keyword-based approach. Assume there were 20 people involved in a particular case, four of whom were considered key - the “inner circle,” or people who would intuitively be highly involved in the issue at hand. An attorney could read all of the emails of the top four people and identify the words or terms these people actually use to describe the people, places, things and events involved in the case. The attorney then could convert this list of terms into extended Boolean (e.g., proximity, truncation, wildcards and the like), perhaps with thesauri “turned on” to search the emails of the remaining 16 custodians who may have some tangential involvement in the case. If warranted, the attorney could use a concept search tool to return additional documents that are conceptually related to the highly relevant documents. Lastly, as part of the quality assurance plan, statistical sampling could be used to estimate the probable success of the search and retrieval strategy to find sought after documents as well as that of the reviewers to correctly mark documents. The results may require a supplemental strategy, but it will also assist with the defensibility of the strategy - especially where parties disagree on the approach.


Once identified and retrieved, documents must be presented in a way that optimizes reviewers’ time and expertise (i.e., get the documents more likely to be important to the experts first). There are many ways to accomplish such “visualization,” - from a simple list of documents and a timeline (which arrays documents over time) to more complex methods that give reviewers more flexibility in how they can access and review documents. Different visualizations are better at different review objectives - and there are tradeoffs.


For example, a folder structure much like that used in Microsoft’s Windows Explorer can speed review by organizing documents based on labels and folders that are meaningful, such as documents concerning Client A go into Folder A, those concerning Client B go in Folder B, and so on. Technological tools can perform this organization automatically. This organization drives efficiency because it helps reviewers identify which folders likely contain nonresponsive information (and, thus, can be moved through quickly) and which likely contain information possibly related to the case (and, thus, require close scrutiny). The drawback of the folder technique is that, given the size of typical document databases, the thousands of folders and subfolders are unwieldy and difficult to navigate quickly. In other words, a researcher can get lost.


In my experience, the most effective visualization technique for a huge repository of email is one that is based on an approach used for many years by accountants and CFOs to aggregate massive amounts of financial data (such as tens or even hundreds of millions of sales transactions) in a structured way. This “row and column” structure, commonly found in Excel spreadsheet pivot tables and business intelligence software, enables finance personnel to get a high-level look at their organizations’ finances, as well as slice and dice the data to get more detailed financial performance data by business unit, product line, geography or some other attribute.


Recently, this method has been adapted to manage emails and other such unstructured data - where, for instance, the rows could hold custodians and the columns could hold key search terms (or other variables such as review status or concepts, depending on what the reviewer specifies). This method of visualization overcomes the main shortcoming of the folder technique because it displays on one page the documents related to an entire case - thus making it very easy for reviewers to determine, at a glance, who the key custodians are in a case and which of those custodians have documents containing search terms of interest (i.e., the people exchanging communications most relevant to the litigation or investigation at hand).


This intelligence would enable attorneys to quickly determine which people to interview first and which documents should be handled by the most expert reviewers, as well as identify those custodians who likely have only peripheral involvement in the matter at hand and, thus, merit less attention. And like a financial spreadsheet, this technique enables reviewers to drill down into any one area to get more details (for example, by clicking the cell at the intersection of custodian “John Doe” and “Company A” to read all the documents in which John Doe is talking about Company A). The result is less wasted time sifting through irrelevant documents and more time spent on documents that could make or break a case.


To be sure, more scientific approaches to e-discovery are not without error. However, their error rates can be measured and they compare favorably with historic simple keyword search approaches. Indeed, scientific methods should more than enable counsel to fulfill their obligation to make a good-faith, reasonable effort to identify and produce relevant documents. In addition, more advanced methods for organizing and presenting search results can help reviewers find the answers more quickly and, consequently, save their clients substantial time and money.


The views expressed in the article are held by the author and are not necessarily representative of FTI Consulting, Inc.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access