Can your in-house search tools do this or do you have to rely on external vendors?
When lawyers negotiate what data is relevant to exchange under the Federal Rules of Civil Procedure, they determine not only the custodians, file locations, and file extensions, but they also define a so-called negotiated Boolean query. The content of such a negotiated Boolean is extremely important for both parties: the claiming party typically wants to get as much data as possible, often resulting in a legal phishing expedition. The disclosing party aims to disclose as little information as possible to avoid costly legal reviews and to limit legal exposure and other legal risks.
According to Wikipedia, a Boolean query is defined as: “a query language for a search engine that supports Boolean operators (AND, OR, NOT) and parentheses. A user who is looking for documents that cover several topics or facets may want to describe each of them by a disjunction of characteristic words, such as vehicles OR cars OR automobiles. A faceted query is a conjunction of such facets; e.g. a query such as (electronic OR computerized OR DRE) AND (voting OR elections OR election OR balloting OR electoral) is likely to find documents about electronic voting even if they omit one of the words "electronic" and "voting", or even both.”
If a negotiated Boolean consist of long lists of keywords, and parties agree that there is only a disjunction (OR relation between them), one will retrieve many irrelevant documents as well. If there are too many conjunctions, there may be too few documents. So, often, there is a combination of AND and OR queries, which are nested with parenthesis such as: (funds OR cash) AND (transfer or sweep).
In order to avoid that a document will be retrieved if words from AND relations occur out of context (e.g. one is used in the beginning of a document and the other one is used at the end of the document, as a result of which there is no real relation between them as intended in the original Boolean query) , then the so-called PROXIMITY operator is popular. This allows to find words that occur within for instance 5 words of each other (often denoted as W/5), or one word is PRECEDED with a particular word by 5 words (DIRECTED PROXIMITY, often denoted as P/5).
One of the problems in negotiating such a Boolean query is to define the various occurrences of nouns and verbs such as inflections, plurals, pre-fixes, post-fixes, noun conjunctions (often very common in languages such as German or Dutch), abbreviations, named entity variations, synonyms, spelling errors, spelling variations (e.g. US and UK English or pharmaceutical or chemical names), and there is more! In order to address this, layers use so-called WILDCARDS and FUZZY searches. These mathematical operators allow word variations to be found. A typical examples of such a wildcard search is SCHOOL* which will find SCHOOL and SCHOOLS. Wildcards can also be use in the beginning of the word and / or in the middle of a word, or a combination of wildcards.
Recently, the Lehman Brother Boolean queries were disclosed, here is what they looked like:
(fund* or cash) w/10 (transfer* or mov* or sweep*)
(large or big* or signific*) w/10 (collateral w/10 pledg* or mov*)
(securit* or asset*) w/10 (transfer* or mov* or pledg*)
(repo* or repurchase*) w/10 (transfer* or mov* or pledg*)
*solven* w/20 (transfer* or mov* or pledg*)
(*adequate* or *suffici* or concern* or enough or short) w/10 liquid*
*valu* w/10 (*model* or mark* or book) w/20 (wrong or update or *correct* or hit or P&L or haircut)
As you can see, they contained extensive usage of proximity and wildcard searches. Many search engines cannot handle these types of constructions, or they become extremely slow on large data sets. Because, in order to effectively and fast execute wildcard searches, words that are alike according to some wildcard or fuzzy algorithm need to be organized in special data structures in your search-index. This is something that needs to be taken into account at indexing time, If a search engine is not build to support wildcard, fuzzy and proximity searches with such special additional index structures (many search engines derived from web technology such as Google, Lucene, FAST and several others are not), then the only way to implement such wildcard searches, is to try every possible occurrence of a particular word, so a wildcard such as FUND* needs to search for FUNDA, FUNDB, …. FUNDZ, FUNDAA, FUNDAB, … FUNDAZ, etc. Some vendors use dictionaries of inflected words to limit the search scope, but that will not give you all spelling variations, and there is more and more case law where a judge re-orders a full wildcard search, often combined with penalties and sanctions.
If your search engine cannot handle wildcards properly, this combinatorial explosion will results in very slow search performance and often queries on Terabyte collections can take weeks or never finish at all. You can read more on the exact technical reasons why this is the case here: http://gcn.com/articles/2010/06/28/when-traditional-search-engines-fall-short.aspx or here: http://zylab.wordpress.com/2010/06/25/do-we-understand-the-benefits-and-limitations-of-traditional-web-search-engines-such-as-google-when-you-use-their-appliances-and-technology-in-house-for-mission-critical-applications/.
If your search engine does support these types of queries, you also have to option to full-text index the data locations with the highest legal risks (email, certain file shares and SharePoint) and then collect only the data that is retrieved by the negotiated Boolean queries from the start and avoid high collection cost and collect only irrelevant data. This is often called collection in the wild and is a huge money saver. Especially if the collection is combined with Early Case Assessment and automatic (rolling) collections from certain data locations that result from the legal hold interviews.
So, if you plan to bring eDiscovery in-house, then make sure that you use a search engine that does support fast wildcard, fuzzy and (directed) proximity search, and that can also handle large Boolean queries of hundreds of words such as the ZyLAB search technology. Make sure to insist on a demo on a really large data set, and so not settle for a demo on only a few Giga bytes. If your search engine cannot execute such complex queries, then you have to rely on (expensive) external vendors and service providers to execute the Boolean queries for you. Or even worse, you may be sent back by the court to redo your work at the expense of sanctions, fines, penalties and a lot of other additional cost.