Review: Emerging Technologies
Modern-day electronic discovery is a time-consuming and costly endeavor. Every additional hour of reviewer time that must be spent culling down large data sets to the 20% of ultimately responsive documents is additional cost to the responding party. A well-planned, thoughtful approach to the Review process, a careful and thorough examination of an appropriate vendor and efficient management can result in overall time and cost savings. Utilizing emerging technologies that group like documents together can also enable a more targeted, accurate and efficient review--all of which can ultimately translate into significant cost savings to the responding party.
Document review platforms have undergone many technological enhancements over the past several years. New technologies to improve the efficiency and accuracy of the review process are appearing everyday, from fuzzy searching to concept searching, near-dupes, visualization and social network analysis. What are all of these tools, and how can they assist in the review process?
Sample Emerging Technologies
Keyword searching was the first significant enhancement to the efficiency of the electronic document review process, reducing data sets that can be terabytes in volume to far more manageable sizes. Keyword searching, however, is limited in what it can accomplish as a "first cut" at the data by the fact that it is typically a search for the exact words only, not related concepts. Searching for words only will naturally result in a significant percentage of "false-positives," that is, retrieved documents that do indeed contain a keyword but that don't contain that word in a document or a context with any relevance to the discovery request. Advances in search technology will greatly enhance the review process.
"Concept" and "context" searching are technologies that offer users the ability to increase the efficiency and effectiveness of electronic discovery searching, organizing and review. Concept search technology may be based on neural networks, Bayesian methods, latent semantic indexing, or other high-level mathematical algorithms designed to learn the underlying associations among the words within the document collection. The index of the data generated by the specific technology basically maps the language use, word patterns, concepts and ideas of a document much like a human. This "black box" facilitates concept searching and allows the reviewer to search the documents for like ideas or similar concepts without having to match an exact keyword or phrase. These tools can also be used for categorization, clustering and duplicate (or near duplicate) detection. Concept-based tools may use customized t"hesauri and semantic networks, although these typically require human intervention and administration to build.
Context searching allows the user to define a search through keywords or phrases and then direct the system to find "similar" or "like" documents. There are many systems available that provide "Find more like this" functionality - even Yahoo offers Y!Q, a context-based search tool. This tool is particular useful when a reviewer stumbles upon a key issue previously unidentified.
A concern with both technologies is the precision of the results. They are very useful in increasing the recall (number of results), but the precision (or relevancy) of the results may suffer. Tools that rank the results by their overall relevancy to the concept submitted are most useful.
Auto-Coding or Clustering
Several electronic discovery vendors offer auto-coding or clustering capabilities. This process utilizes search technology to automatically identify "like" documents and form them into groups for review purposes. The underlying technology that performs this sorting typically utilizes some type of linguistic analysis, thesauri or concept searching.
Today, two basic approaches to grouping (sometimes referred to as "clustering") like documents exist: rules-based and example-based. In a rules-based model, the review team establishes criteria (rules) that help determine relevancy rates in the overall document collection. A rules-based approach is similar to keyword searching, but often provides higher recall than keyword searching, as the search engine may use proximity, word patterns, co-occurrence of key concepts and/or thesauri to determine search "hits" and relevancy.
In an example-based approach, documents programmatically describe themselves based on the concepts that are identified within each document. The system then groups documents that are contextually similar together for review. An additional benefit of most example-based approaches is that reviewers can employ discovered material or their knowledge of the matter to more narrowly group potentially relevant content. For example, when relevant documents are identified during review, an example-based system can regroup the collection of documents based on the reviewer-provided example (or examples) of the relevant document(s).
Tools that allow for a visual map of the content of the dataset can be very useful in organizing key documents and prioritizing the review process. These technologies are typically based on word counts (nouns and noun-phrases) and allow the reviewer to navigate through a visual representation of these groups and their relationships. The groupings allow for bulk identification of key documents as well as isolation of non-responsive concepts.
Visualization tools can also be used to map social networks, or email conversations. Quite often who knew what, when they knew it and who communicated it to them are key considerations of a case. Social network technology allows the reviewer to trace a custodian's email conversations and easily see the history and direction of their email exchanges both within and outside of the organization.
Benefits of These Technologies to the Review Process
This technology may not only group like documents based on similar context, but may rank those documents by relevance to the searched-for key word or concept, with the most relevant retrievals being listed first. From a time efficiency standpoint, this can allow reviewers to prioritize their review time and concentrate on the most likely relevant or responsive data first.
Once like documents are grouped together, batch categorizations can be undertaken which preserve consistency across the data set. Grouping and batch categorization can be made for privileged documents, for documents with similar issues of confidentiality, or for documents with similar issues of responsiveness. These technologies can also be used to group duplicates or near-duplicates together for batch categorization and consistency.
Categorization consistency within a case is obviously important, but there is also a need to be consistent in document categorization across cases. Modern-day litigation is often multi-jurisdictional, requiring different legal representation in different states and localities for cases involving similar issues and the same data set. Having a solution which can group like documents together based on context can significantly aid efforts to assure documents categorized as privileged in one jurisdictional case, for example, are similarly categorized in another jurisdictional case. Failure to have such consistent categorization can, and has, resulted in raising significant legal questions regarding the waiver of privilege or confidentiality. It can also be useful in such cases to implement a litigation repository or archive to provide the review team with previously reviewed material that may be relevant to other legal matters.
These solutions can also prove useful in culling out or grouping data that is either known to be irrelevant and non-responsive, or which needs to be reviewed separately from the general electronic data population. Because these solutions can group contextual similar documents together, whole categories such as spam mail or foreign language documents can be separated from the general data population. A category like spam mail can potentially be ignored altogether, and a category like foreign language documents can be tasked to reviewers with foreign language expertise for analysis.
Review of duplicate or very similar electronic content is another time-consuming irritant of electronic data discovery. For example, if a corporation distributes a weekly sales report, an exact de-duplication culling process prior to review will not suppress 52 very similar files. Using solutions that group like documents or near-duplicates, a reviewer can decide whether all 52 are responsive or not. To the extent duplicate data can be grouped together at the outset of a search (as opposed to be stumbled upon randomly as the search proceeds), batch analysis and coding of documents can be facilitated, which again saves reviewer time and costs.
Getting a "wide perspective" of the electronic document universe is another highly beneficial aspect of emerging technologies like linguistic patterns and conceptual searching. Lack of structure is one of the biggest challenges when dealing with electronic documents and electronic discovery. When contextually similar documents are grouped together, certain facts about the document universe may be discerned. For instance, a grouping of like documents will often result in threads of e-mail messages sent, received, replied to and forwarded among a group of custodians. A reviewer can see the full context of the electronic exchange and make a more informed categorization decision regarding the group of e-mails as a whole, rather than categorizing each message as a stand-alone communication. Some technologies further accelerate review by enabling the attorney to review the last email in a unique thread and automatically and consistently mark all others in the thread accordingly.
Questions to Vendors
Some relevant questions to pose to vendors offering these technologies are:
- How do conceptual search systems determine the "concepts"? Does the user participate in the creation of a thesaurus or are the concepts automatically identified by the technology?
- Does the system rank the search results by relevance, and what criteria does it use in that ranking?
- Does the auto-categorization tool perform a first cut categorization automatically, or require reviewers to submit criterion?
- Does the system allow the review team to further tailor the categories to its review requirements through the use of rules and/or examples? The key to any solution is the flexibility to not impede the "rolling" nature of most large-scale review projects.
- Does the categorization tool place the same document in more than one folder or restrict it to only one?
- What tools and/or algorithms are used to identify email threads, duplicates or near-duplicates?