eDiscovery Processing: Scoping Electronic Discovery Projects

Pre-planning is an important component of any good project management methodology, and the same is true for the methodology of managing electronic discovery processing projects. Decisions made in the identification, preservation, and collection phases of the electronic discovery life cycle dictate some of the requirements and activities that will be undergone during the processing phase.

For example, typically, the types of data preserved and collected, the amount of data contained in those collections, and the time frames for production have been decided prior to the act of processing electronic data or engaging with a provider to process electronic data. Other influences over the requirements for the processing phase are the needs of the review and production phases. Items such as the schedule for review and the technologies used to review the data influence the way the data must be processed. Each one of the decisions made in all areas drive the costs of the electronic discovery processing project. By taking each one of these elements into consideration prior to undertaking the Processing phase, the consumer can make informed decisions in line with the legal and financial objectives of the project.

Determining Scope

Electronic documents are discoverable under a number of statutes, including the Federal Rules of Civil Procedure, states' rules of civil procedure, federal criminal rules, state criminal rules, and case law addressing electronic documents. They are an increasingly large factor in internal and governmental investigations. The Federal Rules, which went into effect in December 2006, clarify the discoverability of electronic data by adding (in modifications to Rule 34) the requirement to produce, in addition to "Documents," "Electronically Stored Information." Before this rule took effect, the requirement to produce electronic data has been the result of a consistent interpretation by the courts of the word "documents" as inclusive of electronic data compilations.The modification expanded the definition of what parties are able to request of each other to include "any designated electronically stored information - including writings, drawings, graphs, charts, photographs, sound recordings, images, and other data or data compilations in any medium..."

Typically, in a meet and confer session, the legal team (outside and in-house counsel) determines the list of custodians, the types of data, the form or forms in which it will be produced and timeframes that are required for the matter at hand. Each type of data available within the appropriate timeframe must be processed, reviewed and produced for each custodian. For example, for a particular matter, there may be 100 custodians who have data collected from their email, hard drive, and shared network location over a six-month period. Each piece of data must be accounted for in every stage of electronic discovery processing.

What is commonly known about the electronic discovery project by the time of processing? Typically, we know the list of custodians that have data relevant to the matter. We know the size of the dataset, at least in regards to quantity and type of media (e.g., number of tapes, CDs, DVDs, etc.). The size of the dataset can be a good determinant of cost. By employing tools we can analyze the type and contents of that media. And by understanding the goals and requirements for the review, we can determine the processing specifications. These are the factors that determine how much data will be processed and when. Corporations and their law firms can structure their reviews based on this analysis. These pieces of the puzzle help consumers control and manage their electronic discovery projects.

The requirements for processing electronic discovery data can range from the simple to the extreme and from small datasets to large volumes. The schedules for completing the processing can also run from short (days) to long (months) durations. The result of these large variables means that determining the scope of the task in terms of the types and size of the data is critical in identifying the resources needed to complete the processing.

Typically the fixed variable is the schedule. Large jobs that are required to be processed in short periods have resulted in costly fines and awards when the resources to perform the processing were not available. The processing environment, whether it requires sequential or distributed parallel processing and the degree to which the environment is able to scale across one or more processing jobs can also be critical in determining the scope of processing to be performed as well as the schedule. In short, dataset sizes and types, timeframes for processing, and the complexity of the processing requirements drive the scope of the project. This scope will ultimately drive the cost. (See Cost Drivers.)

Dataset Sizes and Types

The custodian list is another important step in establishing the scope of an electronic discovery project. The number of custodians can (and typically will) change several times during the engagement. For purposes of authentication and chain of custody, mapping users to the data source is critical. Since each custodian has the potential to generate many gigabytes of responsive data, the number of custodians is a key variable in determining the overall magnitude of any electronic discovery task.

Coordinating with the legal team, the collection team, the processing team and the review platform team is critical to success. It is often appropriate that there be a single point of contact for coordinating the custodian list in order to assure consistency in the naming conventions and updating the list.

The main categories of data types, (i.e., email, hard drive, and shared network drives in the above example) are contained on specific pieces of media. The physical location of the source data as well as the storage location of processed data files must be traceable. Examples of types of media are CD-ROMs, DVDs, magnetic media (i.e. DLT and LTO tapes), email and file servers, cell phones, thumb drives, etc. Within these categories of data types, there are a variety of file types present. Each of these will be a determining factor in the cost and time it takes to process the data. There are many different technologies to help consumers get a handle on what is contained on their electronic media. Which technologies are applied for a given matter will be driven by the processing delivery requirements.

Media Analysis

Media analysis involves expert consulting services as well as in-house and third-party tools designed to analyze media content before processing begins. Media analysis provides clients with industry-standard best practices for developing pre-processing requirements, based on media analysis reports that capture such information as total file counts, file types, numbers of files per file type, size per file extension, and total project size. It may also involve native file review, hard-copy blowback of endorsed TIFFs, or pre-processing de-duplication that often greatly reduces the size of a data set to be processed and produced.

The size of a data set can be misleading. One might think that the larger a collection is, the longer it will take to process. However, the blend of file types significantly drives its processing requirements. For two 10 GB data sets, one may contain 70% spreadsheet files and the other only 5%. The collection with the larger percent of spreadsheet files will take much longer to process and will probably result in more pages for attorneys to review. Likewise, identifying large numbers of system files or commercial files would immediately affect the scope of any project.

Media analysis can help to identify the structure and makeup of a data collection, thus determining the real scope of a project and establishing accurate time frames for its processing. Pre-processing media analysis may help to refine requirements for handling complex data types (such as documents and spreadsheets embedded in e-mails) or projects that involve several stages of review (such as an initial, native file review and subsequent TIFF image production). Issues like the scalability of the processing environment, including the need for and ability of tools to segregate redundant files and handle embedded or nested data types, must be clearly understood and closely matched with the delivery requirements for any matter.

Media analysis tools provide clients with reports on file information early in the electronic discovery process, allowing them to develop the best strategy for effective processing. Media analysis tools, from IT directory listings and analysis on file sets to third-party applications with advanced reporting capabilities, can also gather data samples so as to apply benchmarks for accurate projection of the data set that is finally produced.

The variety of media analysis tools, with relative success rates, variant metrics, and diverse and changing technologies, underscores the value of a media analysis team that properly applies standards and measurements. The determination of a correct media analysis process and tool set, based on the scope and requirements of the project, is critical to success. Selective media analysis with proper measurement, testing, and implementation is the key to withstand any Daubert challenge if, or when, it arrives.

Processing Specifications

Processing specifications drive the scope and therefore the timeframe and cost of processing and reviewing electronic data. Agreement must be reached not only on what data is to be processed, but also on what the input and output format of that data will be. Chief among those specifications is the type of output that the review requires and what processing will be required to provide that output.

There are several options for output from files: the metadata associated with the file, the searchable text contained in the file, the native file itself, as well as a rendering or image of the file. Each of these components contains vital information about the file that is being reviewed, and depending on the review system and type of review the law firm or corporation is undertaking, any combination of output can be provided.

For example, in the case of an internal review by an in-house counsel, a corporation can elect to review the metadata and text associated with a collection of files to determine if a more extensive review is warranted. In contrast, for a Hart-Scott-Rodino (HSR) 2nd request, where review and production deadlines are mandated by regulation and under tight time frames, a review team may require all required forms of responsive files be output in the first pass. Review needs and matter specifics guide the law firm's and corporation's choices for output.

Since processing specifications can aid reviewers by providing fielded information that can be used as a first pass at tagging and coding data, legal teams have additional points to consider with respect to review needs. For example, it is possible to determine during the processing phase which documents were sent to or received by specific email address. If email was sent to or received by a counsel, those documents could be provided to the review team already marked as potentially privileged documents. By considering these options, the review team can save time, and therefore cost, in their review.

One of the benefits of processing data automatically is the ability to hone datasets to a more manageable and responsive set, based on a specific set of criteria. Data culling options are tools that shape the electronic discovery project significantly. The effectiveness of the culling strategies employed is one of the most significant contributors to overall project cost. Any data that can be culled from a collected data set - and therefore does not require attorney review - saves significant review costs. Many times the culling rates exceed 90%; in large data collections, incremental percentages of culling effectiveness can save millions of dollars in review costs.

There are several options that are widely accepted and used to manage increasingly large collections of electronic information. The goal of data culling is to leverage technology in a legally-defensible manner and reduce the collected material to a more manageable set of potentially relevant documents for review. Any culling methodology will not be considered reasonable if it operates to exclude relevant documents pre-review.

Technologies have been developed to assist in this culling of data through suppression of duplicates called de-duplication: filtering of documents based on their attributes (metadata) such as date ranges, file types, as well as searching for key words, phrases, and concepts. During this pre-planning stage, the choice of culling options to be employed will direct assumptions regarding how much of the data collection will be segregated prior to review, affecting subsequent processing and review costs. (See searching sections for a more in depth review of options available.)

Once decisions are made regarding what will be processed, there are myriad processing options available for the resulting files in the dataset. Some of these options refer to how a document is presented. For example, an email message can be presented in text format or it can be presented in a format consistent with the look of the file in its native email application.

Also, clients have an option to determine how time zone dates and times are represented based on where the data was collected. One method of handling time zones is to normalize the entire dataset to a standard time zone, such as GMT or the time zone where the review team or corporation is located. Technologies also exist in the market to customize the time zones for subsets of the data collection, such as by user, using a time zone the legal team feels represents the data most appropriately.

There is also a collection of processing options related to how documents can be presented in image form. Often, files (particularly email files) will contain attached or embedded files. These files must also be identified and processed while maintaining parent/child relationship hierarchies.

Determination must also be made for the handling of files containing algorithmic data or formulae, such as Excel spreadsheets and Access databases. The intent of these options is to represent the file in a way that is meaningful to a review team, while ensuring that all of the data is represented. For example, the review of a spreadsheet in its native format can be much more meaningful to a reviewer than would be a converted format (e.g. TIFF or PDF) because of the accessibility of hidden formulas within the fields.

Similar problems exist for processing animated files such as PowerPoint presentations. Care must also be used when presenting these to a review team. By providing the team with options in the scoping phase, review teams can make decisions that will affect the ease of document review and ensure they receive the data in the manner that they expect.

Examples of Processing Options by File Type:

  • All Files
    • Threshold files larger than specified size
    • Auto date function options
    • Header and Footer options
    • Provide in image or native format
  • Email Files
    • Text representation
    • Native representation
    • Display archive and path on image representation
  • Word Processing Files
    • Reveal comments
    • Show edits and tracked changes
    • Paper size and orientation options
  • Presentation Files
    • Print speaker notes or slides only
    • Objects auto fit to printed page
  • Spreadsheet Files
    • Eliminate blank pages
    • Print area options
    • Reveal comments
    • Format custom column titles


The goal of scoping the processing specifications is to inform the legal team as to the options to ensure that the dataset is treated in a manner consistent with the needs of the matter and the review. It is optimal to iron all issues out prior to processing the data. However, the scope of litigation can change quickly. As part of the change management process, it is critical that the processing specifications should be meticulously documented throughout the electronic discovery process. If changes are made midstream, the date, specific change and person authorizing the change should be documented. Good change management will ensure that the data is handling in a legally defensible manner.

Estimating Scope and Delivery Schedule

It can be difficult to estimate time and cost in the earliest phases of an electronic discovery project, since it is hard to know what you have until you actually collect and analyze it. Furthermore, cost estimates during the processing phase pose a challenge, since specifics of the matter change such as the list of custodians may, as the data is collected and more is known about the case. Expect the time and cost estimate to change as you progress in the electronic discovery process.

However, by employing good planning techniques, understanding the data to be processed, and understanding the processing options and impacts of those choices on the delivery schedule, estimates can be made using industry benchmarks (or client specific benchmarks where available). Certainly as the scope changes, these estimates must be adjusted. Staying informed during the electronic processing project will keep everyone up to date as to the project's status and change management implications.

Litigation sometimes seems to move at the speed of light. Electronic discovery projects are often initiated under tight timeframes and intense scrutiny. These factors provide even more reason to gather as much information as you can at the beginning of the process. By collecting the information discussed above, corporations and their law firms can arm themselves with information about the data that can assist them in making cost effective and legally savvy strategic decisions.

By being proactive during the ID phase, corporations can be sure to get all relevant data as it pertains to the investigation (or any electronic discovery need, either internal or external), as well as eliminate irrelevant data. This will allow for better time and cost estimates for processing, while mitigating the risks of potential sanctions due to missed evidence, spoliation, or missing deadlines due to processing more data than is necessary.

Proactive preparation for potential electronic discovery requirements prior to the initiation of specific electronic discovery requests is also possible. Some corporations have begun to apply collection processes to electronic data in advance of a specific event, for the purposes of compliance. It is important to keep in mind that for organizational discovery (a company is managing its electronic assets with electronic discovery tools prior to a specific litigation event), the collection and processing of the data may be done initially without knowing what the specific requirements for a case might be.

In these cases, organizations must establish defined processing goals that are broad enough to meet both their internal requirements (archiving, indexing, and compliance) as well as external requirements to preserve the data should it be required for a case.

Source: EDRM: (edrm.net)