eDiscovery Analysis: Techniques and Tools

What are some of the most commonly available eDiscovery analysis tools and techniques?

Search

The most basic and commonly available tool of analysis is search, which is used to find documents according to various selection criteria. In general a good search tool should have most or all of these capabilities:

Multiple Fields

Represent each indexed document as a set of fields that can be queried separately, such as: Bates number or other unique identifier, subject or title, body content, created date, received date, last modified date, last accessed date, author or sender, "To" recipients for email, "Cc" recipients for email, "Bcc" recipients for email, and type of document (e.g., .doc, .ppt, .xls).

Complete Set of Query Operators

  • Term queries: retrieve all documents that contain a specified term in any specified field(s).
  • Phrase queries: retrieve all documents that contain a specified phrase in any specified field(s).
  • Near queries: retrieve all documents that contain a specified set of terms within a specified proximity of one another.
  • Range queries: retrieve all documents whose dates or other values in specified field(s) are within a specified range.
  • Wildcard queries: retrieve all documents containing a term or phrase in a any specified field(s) that matches a specified wildcard pattern (e.g., query for "SX*" should find documents with "SX1" or "SX23", etc.).  
  • Fuzzy queries: retrieve all documents with content similar to a specified term or passage in any specified field(s). For fuzzy terms, these queries can find content where the term is misspelled and can therefore help to ensure completeness of results. For fuzzy passages, these queries can find other documents similar to a document of interest.
  • Boolean queries: retrieve all documents that match any logical combination of the other query operators. Logical combining operators include AND, OR and AND-NOT. For example, body:"Cayman Islands" AND from:(Skilling OR Lay) AND-NOT (to:Fastow OR cc:Fastow), would find all emails containing the phrase "Cayman Islands" that are from a name containing the term Skilling or Lay and not addressed to or cc any name containing the term Fastow.

 

Relevance Ranking

The order of results retrieved from queries should place those documents that most pertain to the subject of the query ahead of those whose relationship is more tangential. Search engines use a number of heuristics for relevance ranking. The most common, called tfidf, weights the terms in the query such that less frequently occurring terms are more important, and then weights the results such that those containing more occurrences of more important query terms are ranked higher. Ranking should also consider matches in the title or subject to be more important than those just in the body, and may take into account many additional factors.

Exhaustive Queries

Some search engines retrieve only the most relevant documents and messages that match a query rather than the entire result set. For example, Google does not retrieve all results because nobody would ever read all of them anyway; a query for "baseball" will say something like "Results 1-10 of about 125,000,000". For electronic discovery in general, and especially for analysis techniques, this behavior is inappropriate. When seeking all materials containing certain terms, or associated with certain people, one must be sure to get everything. Otherwise important items could be omitted and lost.

Sorting

It should be possible to sort the results of any query either by relevance rank or by any desired field (e.g., date or unique ID). This sorting should apply to the entire result set of the query, not just the results visible on the first results page. Resorting generally reissues the query since the entire result set must be consulted to determine the new top results.

Stemming or Lemmatization

The search engine should find documents containing morphological variations of query terms automatically. For example, a query for potatoes should find documents containing the singular form potato, and a query for "John goes" should find documents containing the past tense "John went". However, documents containing the same form of the term as entered in the query will frequently rank higher as they are deemed more relevant.

Performance

Analysis techniques frequently involve performing many successive searches. To do this efficiently, it is important that queries return quickly.

Derived Metadata

Some metadata is directly attached to or embedded within electronic documents and emails, such as author or sender, created date, recipients, title or subject, certain email threading relationships, etc. Intelligent software can analyze this direct metadata, along with document content, and automatically derive additional useful metadata. Examples of such derived metadata include key phrases, contextual groups, directory information, categories and clusters.

Key phrases summarize the content of a document by identifying those words and phrases within the document that best capture its overall topic and unique statements. Research articles are frequently classified by a set of key phrases. Key phrases are most useful if they are applied consistently, i.e., if a key phrase is associated with any document, then it is associated with every document to which it applies. When this is true, key phrases can be a powerful classifying structure that can help one to understand the topics, vocabulary and connections within the collection being analyzed. These same notions can be applied to results of any query; in this case, key phrases can be used to quickly determine a summary and to classify the entire set of content retrieved.

Documents, and especially emails, are generally authored within some context. The best example of this is a reply to an email message, or in general any email within a conversational thread. Context can similarly be determined for documents, for example when documents are attached to email messages. Organizing individual messages and documents into contextual groups, e.g., by assigning a contextual group metadata field to each item, has two significant benefits. The first is to more fully understand the intended meaning of the item in question. For example, a short email may appear in isolation to have an inappropriate meaning, while considered within the full context of its authorship it may be clear and benign. The second use of contextual groups is to find individual items that are not rich in content. For example, a truly inappropriate message might be a one-liner that contains a directive but not explicit content words. Finding and properly interpreting such a message can be difficult, but the message's context may provide the beacon. Searches for words and phrases characteristic of the topic will find other messages and documents within the one-liner's context, and exploring the contextual group will then quickly surface the important item.

Enterprises generally maintain a directory of their employees, usually in Microsoft Active Directory, some other LDAP based directory-server, or within their Human Resources application. These directories contain a substantial amount of information that can help to interpret who had access to each item in the collection and how those people are interrelated. Directories generally identify personal names, department groupings, email addresses, mailing list memberships and login names, among other useful information. For example, such information is invaluable to determine who received an email that was sent to a mailing list. Directory information can be used to extend the values of certain explicit metadata fields (e.g., recipients of a message) or to create entirely new metadata (e.g., departments that created or received content).

One pitfall in using this directory data is that in large organizations the directory may be continually changing, as people come and go, transfer among departments, join or depart mailing list groups, etc. Some records retention systems will capture this data when documents or emails are created and stored. In other cases, it may be possible to get a sufficiently contemporaneous copy of the directory to be useful, but one must understand that some conclusions drawn from it may be incorrect. Even in this case, the information can be useful to organize and navigate the collection and determine what is probably true, subject to definitive verification by other means.

Documents and emails in an electronic discovery collection can be cataloged in many different ways. These categorical structures can be helpful in understanding and navigating both the collection as a whole and the large results lists generated by search queries. In general, the different metadata fields can be used as categories. For example, content can be categorized by who created it, who had access to it, when it was created, which key phrases it contains, which contextual group it belongs to, which topical clusters it belongs to, etc. See the section below on Guided Navigation to fully understand the value of these categorizations.

Clustering technology can be used to automatically group documents and emails together that pertain to the same topic. Such a group of same-topic items is referred to as a "cluster." Clusters are most useful if they are appropriately named, i.e., if they have a label that is meaningful to analysts and reviewers. Naming clusters effectively is frequently more difficult for software to do than creating the clusters to begin with. Clusters can also be hierarchical, where a set of say a dozen top-level clusters organizes all of the content into a set of broad topical areas, with the items within each such broad topic further clustered into subtopics, each of those subtopic clusters further refined into sub-subtopic clusters, etc. Such a hierarchical cluster structure can provide a useful topical map of the entire collection, or of the results of a particular search query. These hierarchical structures can also be helpful to organize review activities so that individual reviewers can focus on specific topics relevant to the case.

One pitfall in using clustering technology is that it is intrinsically heuristic. Just because an item is in a cluster does not mean that a person would associate that item with the topic indicated in the label of the cluster. Even if an item is not in a cluster, a person might deem the item to be highly relevant to the topic of the cluster. Labels used to name clusters might seem unintuitive or a strange combination of concepts to analysts and reviewers. Clusters should be used as a helpful first pass topical categorization, not as a set of definitive conclusions.

Guided Navigation

Guided Navigation is a powerful analysis tool for exploring a collection or query result set when the items have been categorized in multiple ways. Contrast the following:

  • Typical search system: You start a new case and have your first access to collected material. You are not certain of the important topics, key players, important dates, or specific vocabulary in use. You think you have an idea of what might be important and so try out a few simple queries to see what you find. For each query, you get back either zero results or thousands of results. When thousands are returned, you can see maybe 50 per page with short summaries and some relevant metadata. You scan the result lists and try to guess what might be helpful, maybe reading a few items. You have no idea what might lurk in the thousands of results you never see, or behind queries you don't know to formulate. You quickly give up and start an exhaustive manual review process.
  • Guided navigation: You start a new case and have your first access to the collected material. You see several different catalogs of all the content, identifying the most important people, time frames, key phrases, topics, departments involved, specific monetary figures mentioned, etc. You drill into any category of interest, say an important person, and you immediately see a more refined cataloging of all materials associated with this person. You drill in further to a key phrase or topic of interest and see how that person is connected to that issue. You quickly discover vocabulary in use, the range of topics in play, who the key people are, how topics and people are interconnected. You start mixing your exploration of the categories with freely expressed search queries that seek items you begin to realize are important. For every query you enter, you quickly understand the structure of its complete result set, again seeing and exploring the people most important among those result items, the date ranges they fall in, the key phrases and topics that pertain to them, etc. You quickly discover both the big picture of what is going on and find specific items of importance.

 

In general, guided navigation presents multiple categorical organizations of the entire collection being analyzed, of a particular sub-collection such as an individual reviewer's folder, or of the results of any search query. Categories can include people, authors, recipients, departments, date ranges, key phrases, topics, contextual groups, monetary figures, etc. Within each category, e.g., people, a relevance ranked list of matches is presented, e.g., the specific people most important to this particular set of documents and emails. The presentation of relevant categories and matches provides the "guided" part of "guided navigation." These help the user to understand the collection, sub-collection or result list as a whole so that he or she can decide whether or not it is important and what parts should be explored urgently. It is possible to then drill down into any category match, e.g., a particular topic or date range. That produces a smaller sub-collection with its category structure, in turn supporting further drill down. At any point, it is possible to start afresh from the whole collection or issue a new query against the whole collection, or one can selectively "back up" by undoing any prior drill-down. The ability to drill-down and back up provides the "navigation" part of guided navigation.

Visualization

Everyone knows a picture is worth at least a thousand words. Relationships between people, between items in the collection, between topics, or between any other items available for analysis can be invaluable to fully understand the case and are best presented visually.

The following are example useful visualizations that software tools can create:

  • Social network analysis. For the collection as a whole, or any sub-collection (e.g., the results of a search query), show the most important people and the communications links between these people. This information is useful directly to identify key people and their relationships, e.g., to decide whom to interview about what. It also provides a further useful categorization of the content, as it is possible to drill down into the different nodes (circles in the illustration) and relations (lines in the illustration) to see the specific documents and items that connect people to one another and to a topic of interest. This extends the notion of guided navigation into the realm of more complex and particularly useful relationships that are best presented and understood visually.
  • Context group structure. The documents and messages that collectively form an entire conversation or other contextual group typically have an interesting internal structure. For example, for a collection of emails, each message may be replying to or forwarded from a prior message. The participants involved in the conversation may change as authors add or delete recipients, e.g., to have a more private side conversation elaborating on some aspect of the discussion. The structure of replies, forwards, attached and distributed documents, and changing participants can best be presented and quickly understood visually.
  • Topical cluster analysis. Topical clusters of documents and messages have interrelationships and internal structure that is best presented visually. The subtopic relationship in this illustration is represented by a circle or dot within a larger circle or dot that represents the broader topic. Clusters at the same level are related by semantic similarly; circles representing clusters are drawn closer together if the topics they represent are more closely related. The lines in the figure through the circles represent the key phrases that are common to all the clusters on that line. This presentation of nested circles, dots, and lines provides a topical map of the content that can be quickly grasped and navigated. It is possible to then drill into any cluster to further explore, and further sub-cluster, the specific documents and items pertaining to that topic.

 

Conclusion

Numerous analytical tools exist that can add substantial value, provide efficiencies, and save overall bottom-line costs in an electronic discovery investigation. These tools range from generic tools built without any thought of legal investigations, through to specialized forensic, concept-mapping, guided navigation, and visualization tools specifically designed to speed and facilitate electronic discovery. Frequently people involved in analysis use more than one analytical tool. Separate analysis tools rarely integrate with one another and are almost never founded upon consistent paradigms and assumptions. It is important, then, to understand the assumptions you and your tools are making, to understand their limitations, and to proceed carefully with these factors in mind. Using an integrated analysis tool suite specifically designed for electronic discovery simplifies these issues.

Source: EDRM (edrm.net)