Like many fast-paced industries, e-discovery seems to create and abandon trends with every news cycle. Observers can find it difficult to stay abreast of the latest terms and technologies as e-discovery remains a highly specialized segment of the technology world. Yet the popularity of “big data” has cast a spotlight on how truly massive collections of data can be better managed and understood.

To be clear: The presence of big data is undeniable, and it is here to stay. Those who believe that big data is the flavor-of-the-month are often unaware of its evolution, and the entire field of e-discovery can be considered a prime progenitor of the kind of data analysis we now know as big data. E-discovery was one of the first areas outside of the realm of Internet search engines in which users had to quantify, examine, scale and make sense of truly massive amounts of data. The technology and players in Internet search engines and e-discovery continue to cross-pollinate, and big data has begun to deliver massive benefits to e-discovery.

Big Data and Data Science

Some of the confusion surrounding big data can be attributed to the fact that the term means different things to different people. Gartner originally coined the term, and in 2012 Gartner Analyst Douglas Laney refined the definition in the report “The Importance of 'Big Data': A Definition,” as follows: “Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”

This definition allows ample room for interpretation and explains why so many organizations have rolled out big data offerings or initiatives. However, one aspect of big data has given rise to the discipline of data science: a field of inquiry focused on deriving meaning from data. Data science uses techniques like predictive analytics, machine learning, crowdsourcing, sentiment analysis, supervised learning and a host of other powerful tools to extract meaning or knowledge from data.

Data science is, indeed, a science. Like all the sciences, it relies on the scientific method – the most powerful problem solving tool yet devised. The scientific method mandates that a hypothesis be supported by evidence and that other researchers independently reproduce findings. Reliance on measurement, testing and iteration is central to the scientific method.

How a Scientific Approach Compels the Industry to Improve

Robert Pirsig wrote, “The real purpose of the scientific method is to make sure nature hasn’t misled you into thinking you know something you actually don’t know .”

While data science was born in universities, labs and research groups, e-discovery was born out of a practical need to adapt the legal discovery process to work in the world of electronic information. The early days of e-discovery can be charitably characterized as ad hoc. In the early 2000s, e-discovery practitioners simply did their best to adapt traditional (i.e., paper) discovery best practices to the rising volume of data. The earliest attempts involved printing all documents and “backing up the truck” – a method still occasionally used today. But this approach was quickly abandoned as unsustainable for tackling ever-growing volumes of data. In response, litigators and vendors attempted to further adapt paper processes to the electronic data realm, and this adaptive strategy was reinforced by the conservative nature of the legal industry. The effectiveness of technologies and processes was rarely measured because they were believed to be tried and true and, therefore, involved a minimum of complicated technology justifications in court.

In the last five years, however, the landscape has shifted dramatically. Recently, some of the long-held assumptions (or hypotheses) regarding the reliability of past data analysis techniques have not withstood scrutiny. Indeed, as noted in the epigram above, it appears the e-discovery industry was misled into thinking it knew something it actually did not, and some of the most bedrock principles and practices in e-discovery have been upended. Numerous, intensive studies have revealed that ad hoc keyword searching does not return most of the expected documents, and that traditional linear document review is not nearly as accurate as previously assumed. While these revelations have been a blow to service providers that relied on these techniques for years, they have pushed the industry to re-examine itself from the ground up and evolve according to data-driven hard science as opposed to the pseudo-science of custom, received wisdom and general assumptions. In short: The scientific approach, when applied to e-discovery, compels the industry to improve.

A perfect example is the case of keyword searches in e-discovery. For years, conventional wisdom held that keyword searches would locate documents most likely to be responsive. Culling was achieved by working from a list of keywords created by subject matter experts and/or attorneys to locate which documents should to be flagged for review. Since keyword lists were created by those familiar with the documents and the matter’s issues, they were assumed to be the best method to glean the most relevant material from the data. However, once this assumption was tested, unvalidated search terms were found to be exceedingly inaccurate. TREC  Legal Track studies have routinely shown that such searches return as little as 10-20 percent of the relevant materials. These studies, specifically those from a 1985 report by David Blair and M.E. Maron, “An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System,” confirmed what information technologists reported 20 years earlier: Participants believed they had located 75 percent of the relevant materials using keyword searches, but had only retrieved 20 percent.

But testing is only half the battle. The knowledge of the deficiencies of keyword searches alone does not improve accuracy; it must be tied to concrete action. For example, validating and testing search results through sampling and using an iterative process to incorporate feedback from the testing phase to further refine the search have proven to be a very effective way to improve results.

Data science techniques are being successfully adopted and adapted by the e-discovery industry. Electronic discovery is not new to the scientific method, as evidenced by the 1985 Blair and Maron report, and those who have advocated for the use of data sampling to test results have been active for many years. However, these methods are gaining traction, and the rate of adoption is increasing. In the last five years, Sedona and EDRM have both published detailed reports and guides on how to validate e-discovery work. TREC, EDI and even the courts have underscored the need to measure search and review efforts in terms of recall, precision and accuracy scores. The scrutiny that is now applied to technology-assisted review tools is clear evidence of big data and data science’s positive impact on the field of e-discovery. The ultimate validation of this more science-based approach will be the emergence of reduced risks, expedited timelines and, most important, mitigation of the overall costs associated with e-discovery. Of course, as with all scientific speculation, this hypothesis must be tested as well.