Relying on keyword search for e-discovery? It may harm your case: important pitfalls and how to escape them.

December 5 2017 | Committees

By: Charles-Theodore Zerner

Many attorneys rely exclusively on “Boolean” keyword searches to identify relevant documents during e-discovery. For most of us, this is a familiar technology.[1] We learned how to draft Boolean search queries in law school, or on the job, performing legal research. And the process feels comfortable and quotidian: we are accustomed to typing words into search boxes and receiving results that efficiently meet our information needs. Perhaps for this reason, we trust ourselves to draft effective Boolean search queries. And we sincerely believe that our queries are capturing the vast majority of the relevant material.

The problem is that they often don’t. How often? How deficient are they? Formal studies suggest that attorneys using keyword search tools grossly overestimate the percentage of relevant documents they have found. For example, in a well-known early study, experienced attorneys and paralegals were instructed to use Boolean keyword searches to identify at least 75% of the documents relevant to a set of discovery requests. When the searchers thought they had met this goal, their actual rate was around 20%.[2] Recent studies have reached similar conclusions. The TREC Legal Track study from 2007 found that attorneys relying on Boolean search queries captured only 22% of the universe of relevant documents identified using a broader array of tools.[3] How is this possible?

The reality is that Boolean search technology is significantly less effective and more difficult to use than many attorneys realize. And one needs to think about its limitations to use it correctly. If you want to improve your search queries, read on.

 

The Boolean keyword search Catch-22.

Boolean search suffers from a basic problem: unlike modern search engines, it only returns matching documents. That is, it returns documents whose text contains the exact words and phrases provided by the user, within the particular conditions specified.[4] In other words, the attorney must already know what terms are contained in the relevant documents (and in what configurations they appear) to create a search that returns them.

This is a vicious circle. You only find what you already know to look for. As a result, attorneys that rely primarily on their intuition and knowledge of the case to select keywords often miss large numbers of relevant documents without ever knowing it.

So how should one determine what to search for? No matter how small a case, an attorney relying on keyword search should, at minimum, (i) speak to the relevant documents’ authors or custodians to get information about the terms, abbreviations, and other identifying information they contain; (ii) cooperate with opposing counsel to ensure the search accurately reflects the proper scope of the discovery request, and (iii) test the queries to ensure they properly capture the relevant documents.[5] This is the minimum. Unless you meet it, don’t expect your search to hold up in court.[6]

 

To draft effective queries, you must account for synonymy, polysemy, and contextual meaning.

In addition to investigating what to search for, an attorney must draft search queries that will capture the relevant documents. Drafting effective Boolean queries is extremely difficult. This is so for three reasons: (1) synonymy, (2) polysemy, and (3) contextual meaning.

First, a Boolean search will not return a document if the author uses a synonym, alternate expression, colloquialism, or abbreviation rather than the keyword. Thus, to capture a higher percentage of relevant documents, one must carefully draft the queries to include the keyword and its functional synonyms.

Second, because words often have multiple meanings, keyword search has poor precision. Highly relevant keywords may return large numbers of irrelevant items—requiring complex Boolean restrictions that try to reduce these alternate uses without eliminating relevant material.[7] This problem is only exacerbated by the need to include functional synonyms. Indeed, effective Boolean queries are often long and difficult to interpret.

Third, and most problematic, a Boolean keyword search does not capture contextual meaning. Thus, for example, a “smoking gun” email might simply state: “Yes. Do it tonight. Best, Bob.” The meaning of this document does not come from keywords, but from the context, including the identity of the sender and recipient, the date and time sent, as well as the text and context of other documents.[8]

To avoid eliminating relevant documents, it may be necessary to include all documents during important date-ranges, or to or from certain email addresses, or that contain relatively general terms, and then proceed by process of elimination: searching for and removing batches of material within it that can be easily identified as irrelevant. As with identifying the relevant material, however, eliminating irrelevant data requires a-priori knowledge of the documents.

Addressing these three concerns is particularly important when searching email communications, which (unlike most contractscourt opinions on Westlaw, for example) routinely use informal abbreviations, alternate phrasing, colloquialisms, referential language and omissions that rely on context for their meaning.

 

Obtain an early data assessment, use a combination of tools, or both, to get better results at lower cost.

Keyword search is a vital tool. It is extremely effective if you know exactly what you are looking for. But for most e-discovery purposes, it is better used in combination with other e-discovery tools. Relying solely on search tools to identify relevant documents was the only option in 1990. But doing so in 2017 makes little sense. Many classification tools (such as email threading, or near-duplicates detection, for example) are intuitive and easy to use. Consider adding them to your e-discovery toolbox.

If the case implicates a large corpus of electronic documents—e.g., at least five custodians’ email, over several years, for example—and you need to cost-effectively identify the relevant material, consider speaking to an e-discovery attorney about the benefits of an early data assessment. Depending on your goals, an early data assessment can help you (i) more accurately estimate the costs of e-discovery review, (ii) significantly reduce review costs by safely culling large numbers of irrelevant documents without review, (iii) provide early insights into the merits by identifying relevant documents (iv) help identify the most relevant custodians or keywords, and perhaps most importantly, (v) provide the information you need to fashion an effective and reasonable e-discovery plan proportional to the needs of the case.

Wading into the waters of e-discovery can be daunting—not to mention time consuming, frustrating, and costly. There is a lot to learn and it changes fast. And the simple fact is that we cannot all invest the time necessary to keep up. Instead, you and your client may both be better off if you turn to an attorney with advanced e-discovery experience as a resource when you face an e-discovery challenge.

 


[1]        A Boolean search connects keywords with Boolean operators, such as “AND,” “OR,” or “NOT.” For example: Contract AND (“promissory estoppel” OR “detrimental reliance”) AND “stipulation pour autrui”.

[2]        See, e.g., Moore v. Publicis Groupe, 287 F.R.D. 182, 190-191 (S.D.N.Y. 2012) (Peck, M.J.) (citing David L. Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full–Text Document–Retrieval System, 28 Comm. ACM 289 (1985)); see also Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, 17 Rich. J.L. & Tech. 11, 21 (2011).

[3]        The Sedona Conference, The Sedona Conference Best Practices Commentary on the Use of Search & Information Retrieval Methods in E-Discovery A Project of the Sedona Conference Working Group on Electronic Document Retention & Production (Wg1), 15 Sedona Conf. J. 217, 263 (2014) (citing Stephen Tomlinson, et al., Overview of The TREC 2007 Legal Track, http://trec.nist.gov/pubs/trec16/t16_proceedings.html (baseline Boolean search captured only 22% of universe of all relevant documents found by all combined search methods); see generallyTREC Legal Track Overview Papers, 2006-2011, http://treclegal.umiacs.umd.edu/.

[4]        Thus, for example, a search for (Dog OR “Mr. Sniffs”) w/10 (steak OR “filet mignon”)  will find all documents that contain the keyword “dog” or the term “Mr. Sniffs,” if—but only if—it appears within ten words of the term “steak” or “filet mignon.”

[5]        William. A. Gross Const. Associates, Inc. v. Am. Mfrs. Mut. Ins. Co., 256 F.R.D. 134, 134 (S.D.N.Y. 2009) (Peck, M.J.) (“[W]here counsel are using keyword searches for retrieval of ESI, they at a minimum must carefully craft the appropriate keywords, with input from the ESI’s custodians as to the words and abbreviations they use, and the proposed methodology must be quality control tested to assure accuracy in retrieval and elimination of ‘false positives.’ It is time that the Bar even those lawyers who did not come of age in the computer era—understand this.”) (emphasis added); id. at 136; see also Moore v. Publicis Groupe, 287 F.R.D. 182 (2012) (Peck, M.J.). 

[6]        See In re Direct Sw., Inc., Fair Labor Standards Act (FLSA) Litig., 2009 WL 2461716, **1–2 (E.D. La. Aug. 7, 2009) (citing William A. Gross Const. Associates, Inc. v. Am. Mfrs. Mut. Ins. Co., 256 F.R.D. 134, 134 (S.D.N.Y. 2009) (Peck, M.J.)) (“The court said the decision ‘should serve as a wake-up call to the Bar in this District about the need for careful thought, quality control, testing, and cooperation with opposing counsel in designing search terms or ‘keywords' to be used to produce emails or other electronically stored information....’ The court described the case as: ‘[T]he latest example of lawyers designing keyword searches in the dark, by the seat of the pants, without adequate (indeed, here, apparently without any) discussion with those who wrote the emails.’ After reviewing some of the cases and commentators discussing the issue, the court said ‘the best solution in the entire area of electronic discovery is cooperation among counsel,’ and cited the Sedona Conference Cooperation Proclamation. The undersigned echos this statement.”).

 

[7]        E.g., The term “reserve” could be used as a noun, to designate an alternate athlete (bench warmer, pinch hitter), to designate modesty or propriety,  or to designate a backlog or stockpile of something. It can be used as a verb, to earmark or set something aside; or to allow or allot something; as an adjective, to denote something kept for the future, something inactive, or someone reticent, etc.

[8]        Shannon Brown, Esq., MA, JD, Peeking Inside the Black Box: A Preliminary Survey of Technology Assisted Review (Tar) and Predictive Coding Algorithms for Ediscovery, 21 Suffolk J. Trial & App. Advoc. 221, 257–58 (2016) (“While seemingly obvious, the objective of search is to return “matching” documents and assumes that the words alone in the document, and not necessarily the contextual meaning, denote relevance and thus matches.”). 



« back to News