eDiscovery Processing: Stages

Data can be received by an electronic discovery processing team in many forms--on various media or via electronic transmission of the data. Because electronic documents may be evidence, they must be authenticated before being admitted at trial, thereby making a meticulously documented chain of custody necessary. Upon receipt, electronic data must be tracked and managed on both a physical and a file level.

Once the electronic evidence is received, properly accounted for, and documented, it must be restored to a computer systems environment for processing. This process can be as simple as plugging a hard drive into a processing network or copying the contents of a CD or DVD to a central location, or it can be an involved process such as tape restoration.

An important aspect to be aware of when dealing with electronic data is how the various processing options may affect the metadata of restored and/or moved or copied electronic data. Since electronic data can be in many forms and may be contaminated with viruses, it is important to isolate the source data before putting it onto an active network or processing platform until the viability of the data has been determined. Data is pulled off of tape exactly as it was stored on source media prior to being backed up to tape. Therefore, if a file was infected with a virus before being backed up to tape, the same virus will exist once restored from tape. Understanding how viruses are handled will inform the legal team as to the integrity of the metadata available and provided for review.

Tape Restoration

Electronic information can be stored in any number of places. As companies seek to find ways to manage critical corporate information, the challenges associated with the retention and destruction of electronic information have led to a variety of approaches for meeting document retention requirements. As business records and electronic communications outlive their day-to-day usefulness, that information theoretically is either destroyed or moved from the corporate servers to media types that may or may not be organized in a way to facilitate later search and retrieval. By and large, information stored on such types of media are not as easily searched and reviewed as that information sitting on an end-user computer or a corporate server. However, electronically stored information responsive to a particular matter may just as likely be located on one of these tapes as the computer of a specific employee. Tape restoration, therefore, can be an important element of a case, particularly those cases that involve legacy data.

Data no longer in use might be found on backup tapes. The purpose for using backup tapes is for disaster recovery, not data retention, so access to this data can often be difficult, which may affect the timeframes and costs of the project.

Much of the legal discussion on this topic relates to the extent to which electronically stored information is identified as "accessible" versus "inaccessible." This is an important criterion, as a determination that electronically stored information is not "reasonably accessible" may lead a court to consider transferring the cost of production from the responding party to the requesting one.

Due in large part to its portability, durability, and low cost per MB, magnetic tape media is the most widely used form of media for the archiving of data, and has been for approximately four decades. Considering these and other numerous benefits, it is easy to fall into the trap of believing tape is a fail-safe technology, where one simply has to place a tape into a drive to retrieve the contents. Nothing could be farther from the truth.

Nothing can be written to backup tape media without software controlling the process. As the volume of data being archived has grown tremendously over the past decade, developmental emphasis (by archiving software vendors) was placed on writing data to tape more rapidly. Since tape media was perceived as a means of aiding in disaster recovery, no special emphasis was placed on restoration of data from the tape, as restore times were perceived as acceptable and it was presumed that the system to which the data would be restored was the system from which it was archived.

Fast-forward to the 21st century with newly implemented SEC/NASD, Sarbanes-Oxley, HIPAA, and other regulatory requirements. Backup software manufacturers are now playing catch-up with their technologies in an attempt to aid their customers to be time-responsive to many of these regulatory requirements. Failure to be responsive can mean fines, economic sanctions, or even litigation. As a result, backup software customers are demanding improvements to the technology and are even exploring other means of archiving data, such as more costly, but more rapidly accessible, hard-disk-based systems.

Processing tapes during the restore process usually is done with one of two results in mind: to restore a complete system or to restore a subset of files, each having its own method. The method for the first option would require a complete backup of the system and, depending on the way backups were performed, the last differential tape set or all of the incremental backup tape sets. Preliminary work may also have to be done on the system that is going to be restored prior to the restore process, before recovering data, based on the application that was used. Once the preliminary work is complete and the restore process can begin, the data from the complete backup would be restored and then the differential set or all the incremental sets inserted chronologically. Some software packages may need the media inserted in a different order to gather media information prior to restoring, but the restoration typically follows this order.

The method for the second option can be varied. This variety of restoration methods is based on the modification dates of the files in the restoration subset and the backup method used in archiving the data. If the modification and/or creation dates are within the same time period of a backup job, they should be part of the same backup set, whether it is a full backup, differential, or incremental set. If the files reside outside the range of one backup job, the backup sets for each necessary file will be needed. This could include a combination of tape sets much like the first method.

Challenges for the Attorney

Many of the problems created by backup tapes relate to tape organization and tracking, the lack of user-to-tape mapping, overlapping data on multiple tapes, and a failure to understand how information is stored on hard-to-access media. Information stored on backup media is typically not stored sequentially, nor does it facilitate easy searching or recovery, thereby increasing the cost associated with any restoration project. Litigation teams can employ a variety of tactics to gain control over this part of the discovery process, by taking the following steps:

Understand the Client's Data Retention Architecture

The term "data retention architecture," not unknown in the technical world, made its first appearance in Zubulake v. USB Warburg (V), in the context defining the attorney's obligation to understand the client's computer systems. It's safe to say that a client's documents, and specifically information leading to potentially discoverable documents, is likely to be in either easy- or hard-to-access locations, and often in both. Among other issues, investigate the following:

  • Document retention policies
  • Litigation hold orders: Parties may be subject to a hold or preservation order, in which case the normal recycling policies of tapes must be halted
  • Tape recycling practices: Determine normal routines for regular tape backups (e.g., daily, weekly, monthly). Understanding the tape practices employed by the IT team will reduce fruitless searches and permit a more targeted investigation at a later date.


Locate Low Hanging Fruit

A recommended first step is to take a hard look at the "low hanging fruit," that is, locations likely to yield responsive documents without requiring the expense of tape restoration and review. In the absence of information stored in a reasonably accessible location (such as a hard drive, lap top, or server), the attorney and/or legal team may need to embark, as a first step, on tape restoration and the review of the information contained therein.

Sample Specific Tapes

Although one or both parties may request whole collections of backup tapes, it may be appropriate to negotiate the sampling of a smaller and hopefully representative set of back up media as a first step to containing costs. In one instance, and with the consent of the parties, a court has ordered the restoration and review of the last full back up tape, with a particular focus on 36 key individuals and several relevant terms.

Granular Searching

Additional narrowing of the document population can be accomplished through:

  • Server segmentation (e.g., searching an email server first)
  • Data range searching
  • User-designated tapes of those believed to be key custodians
  • Departmental segmentation


Representative Questions

Finally, following are representative questions one can ask of a client in possession of potentially responsive backup tapes:

  • The type of tape
  • Operating system
  • Backup software
  • Version of backup software
  • User information
  • Description of content (if known)
  • Type of email application
  • Type of files


File/Document Extraction

Once the data has been restored from the media on which it exists, the next step to consider is the relationships between the files or documents contained on that media.

Electronic data processing becomes more complex when files have different relationships to one another. For example, an email message is a file, but it may also have attachments that contain embedded or linked data. The attachments can give the email message a different legal significance and context for processing. The email message is said to be the "parent" file, and the attachments are each a "child." The terminology is inherited from computer science, where the original concepts of directory structures emerged. All file systems have directory structures, and most of them support nested parent/child relationships among the files contained within them. This concept of parent/child relationships between messages and attachments is an attribute that is generally required to be captured, preserved, and available as metadata.

Files can also be packed inside of other files. Messages and attachments are contained within containers called archives (.PST for Outlook mail files, .NSF for Lotus Notes), and archives can be nested within other archives. Again, those relationships between files must be maintained and available as the files are reviewed and produced. The requirements of a case will drive the technologies that must be employed to extract and process this data as well as report the parent-child relationship of the data for use by the review team.


Metadata refers to the digital attributes of electronic documents that are appended to those documents during their creation or use in their native application. Metadata is created and exists in its natural state before the electronic discovery process is initiated. The existence of metadata is referenced in the comments to the proposed Federal Rules of Civil Procedure, and is characterized as the historical, managerial, and tracking components of a document or file; these components can be lost when the document is printed to paper or quasi-paper. Metadata can be generated from two sources: the operating system and the software application itself. The types of metadata that are recorded by the operating system are name, dates (create, modified and accessed), file type, and size for all files, and sent, received and modified dates, subject and recipient information for email files. Some of that same data is recorded by the file itself, such as dates and file type. There is other important data captured by the file itself: authorship, revision information, and comments. This information can be an important component of a legal strategy.

Metadata provides search criteria and contextual information, which may not be in the body text. The ability to search for and correlate data during the review process is dependent on how well the metadata was processed. For example, data with altered metadata may compromise a search based on dates of files that were modified between two dates. If this happens, you may end up with more responsive data than if all metadata for all files is pristine. Whether a legal team decides to use the metadata or not, it is a commonly provided component in electronic discovery deliverables. Metadata is extracted and archived as part of processing the source data so that it is available during review. Although metadata may not be used during processing, it is still critical that it be maintained for purposes of electronic discovery. If not, the integrity and authenticity of the data can be brought into question.

While metadata can provide useful and evidentially admissible information, its unique character and often hidden locations can create a high risk of a waiver of privilege. As the volume of electronic information continues to explode, the ability of the legal team to review each document and its associated metadata is substantially impacted due to time, technical, and financial resources. A two-pronged strategy may be employed to reduce the risks of waiving the privilege inadvertently:

  • Negotiate an agreement with the opposing side to protect the production of privileged information
  • Employ sophisticated search and filter technologies to identify and segregate potentially privileged information


Chain of Custody at a File Level

Just as physical media containing electronic documents must be treated as evidence, the same rule holds true for each individual file. One of the benefits of an automatic, technical process is the ability to substantiate exactly what process a file went through prior to admission in a case. This benefit is even greater when the case consists of millions of files; through automation every file is subject to the same processes. Additionally, the ability to account for every file that was processed, even files that are segregated based on data culling criteria or exceptions, is critical to ensure that each piece of evidence was handled properly.


The amount of electronic data in corporations has grown, particularly for those corporations who have had overly broad or nonexistent record retention policies. Out of this data growth, data culling has become a vital element of electronic discovery. A common data culling technique is deduplication. Hardware, software, and service companies have developed technical solutions to reduce duplicate data. Deduplication is the process of identifying and segregating those files that are exact duplicates of one another. The goal is to provide a deliverable that contains one copy of each original document, while maintaining the information associated with each instance of that document within the collection.

There are several ways duplicates are identified. A combination of metadata information can be compared to match files. An electronic fingerprint of each file can be taken and compared using a mathematical hashing algorithm such as MD5 Hash, SHA-1, or SHA-180. In some cases, a hashing algorithm is used in combination with metadata.

In addition to the technical aspect of analyzing files for deduplication, it is important to determine to what dataset the files should be compared. For example, deduplication can occur at the custodian level or globally across all users and data sets. Different types of data may be subject to different types of deduplication; i.e., data from file servers may be deduplicated globally, while email may be deduplicated at the custodian level. In its broadest application, deduplication can be applied across an entire data collection (global deduplication). In this case, only one original file will be provided. For example, if a companywide memo is sent over email, only the first instance of that memo that is processed will be available for review. Through the use of metadata, it can be determined who received this email, but only one instance of the memo is made available for review. This technology has enormous significance from a cost savings standpoint, and since deduplication can occur at various stages of processing (tape restoration, keyword search and/or upload), the cost savings often applies to various processing stages.

While deduplication of exact duplicates is the most precise way to deduplicate a dataset, many times files that are materially similar are not bit-level duplicates and may be reviewed and produced. In an attempt to reduce the occurrence of such near duplicates and to avoid the costs associated with reviewing them, a number of technologies are emerging that promise to manage this issue. If files that are similar to each other are grouped together by such technologies, it may also be possible to manage the review process such that the same reviewer is always reviewing similar documents (for example, Word documents that are similar -- various draft versions, for example -- or threads of emails). Collectively, these technologies are being referred to in the market as "Near Deduplication" technologies. If near deduplication technologies mature to the point that it is possible to assure that no material evidence will be omitted and courts accept the elimination of near duplicates as appropriate data culling strategies, the application of these technologies may grow in popularity.

Deduplication can occur at the custodian level as well, and for the case above, where global deduplication would yield one instance of a file, the custodian level deduplication would yield one unique instance for each custodian. This use of deduplication can be important in cases where the fact that the memo was in a person's email collection is pertinent. For example, in a white collar criminal case, deletion activity can be important. Each custodian's collection would contain one original copy of every document that was found within their collection, giving the review team a complete view of each custodian's data collection.

The application of deduplication technologies can be used for any subset of data as well. For example, it is possible to identify duplicate files within a custodian level, such as in only their email. Deduplication methodologies can be applied to any subset of data, such as within a specific time period. The specifics of the matter at hand determine the appropriate criteria, and the automatic means to execute the criteria ensure that each document is accounted for and tracked, while eliminating the need for redundant review. Applying deduplication criteria can significantly reduce the amount of data to be reviewed.

Data Culling

Data culling is the umbrella term used to describe the technical tactics or processes employed to reduce a large document population to a much smaller set. Deduplication, key word searching, and file extension filtering are a few examples of common data culling strategies. Each strategy can reduce the number of non-responsive files and emails substantially, a critical step in controlling the explosive growth of electronic information.

Filtering can narrow a dataset by selecting responsive files based on file-level criteria such as metadata. The custodian list, file type and timeframe associated with the matter are standard criteria. When using time as a culling criteria, it is important to remember that there are several time stamps that can be associated with any one file (i.e., for application files, create, modified and accessed dates may be available for use, and for email files, time sent, received, created and last modified may be usable). Being specific about which of these time stamps is to be used as the filter parameter will help to ensure the resultant dataset meets expectations.

Other methods of data culling can also yield significant results. System files can be culled from a responsive data collection. Known system types recognized and assigned a common MD5 hash value provided a virtual digital fingerprint that can be used to weed out files not of interest. Such a mechanism allows large system files to be set aside as demonstrably non-responsive, so that the size of the document collection is correspondingly reduced--a significant cost reducing strategy. Segregating the operating system files from a source hard drive early in the processing phase can greatly cut down the data set, helping to minimize processing time, when there is no intrinsic value that could be gained from these common types of file. Other file types, such as internet cookies - files stored to a person's hard drive as the person uses the internet - can also be programmatically segregated. These options help hone the dataset to the most relevant documents.

One of the most common culling strategies is the use of search terms, such as the names of key employees implicated by a lawsuit or investigation and specific words likely to be at play in the litigation, whether contained within the text of a document or within the metadata. These methodologies are helpful in wading through vast quantities of file types and sources, and should be considered as one of the first tactics employed by the legal team. The ability to use technical tools does not, of course, replace the legal analysis required of every case; it does, however, create an environment whereby the electronic information may become more manageable, from both a cost and strategic perspective.

Source: EDRM: (edrm.net)