eDiscovery Collection: Processes

Scope of Collection

The relevant data must be collected from each organization or custodian to be deemed part of an appropriate and defensible collection. Costs and benefits must be weighed to determine what is appropriate and reasonable for the litigation. Courts are increasingly being asked to rule on what must be produced when the parties have difficulty agreeing on the appropriate balance. As technology advances, the amount of data that can reasonably be reviewed is increasing; so too is the amount of data that must be collected.

The collector must be careful to employ all specifications from the identification team in order to collect all the targeted data and only the targeted data. Identifiers often target data by:

  1. User/owner or location;
  2. Date;
  3. Type of systems or files; and
  4. Keywords contained in files or systems.

Regardless of the identification method, however, nearly all data must be collected from a media or a network.

Sources for Collection

There are three primary categories of data capable of being collected. These are: fixed storage, portable storages, and third-party hosted storage environments not under direct control of the data owner.

Fixed Storage

Hard drives are a primary source of fixed electronic storage. Hard drives exist in a number of computing environments. These include:

  1. Network servers;
  2. Backup systems such as RAID devices;
  3. Computer workstations;
  4. Desktop computers; and
  5. Home computers.

Portable Storage

Portable storage includes a broad category of electronic equipment using a wide variety of electronic storage media. Traditionally, most portable data was contained on some type of electronic storage disk, such as a CD, DVD, Floppy disk or a Zip disk. Another traditional portable storage medium is backup tapes.

More recently, specialized portable storage devices have been created for a variety of uses. The number of portable storage devices has grown tremendously with the advent of PDAs (Portable Digital Assistants) and smartphones. Cell phones and iPods are portable electronic devices that can contain relevant data. Newer devices incorporate the function of both PDAs and cell phones. These devices typically store data on a USB drive or other flash memory device. Other types of portable storage devices function much like a portable hard drive, using a variety of storage technologies.

Because portable storage disks and devices are by definition portable and because they are typically not connected to a fixed data storage facility, the collection of data from these devices is laborious and often requires the original device from the custodian.

Collection Methods

One of the biggest challenges in a collection and preservation exercise is determining what collection methods are required or advisable. Each type of data storage requires a different strategy and approach.

Fixed Storage

Manual Copy

This is sometimes referred to as the "drag & drop" or Windows method for copying data. A manual copy is done by copying selected files and/or directories, then pasting them to a network folder and/or CD. The copying is either done by the computer user or the corporate IT staff.

It is unavoidable for directory level metadata to be modified during the "drag and drop" process. This is the metadata maintained by the operating system (e.g., Microsoft Windows) such as "Create Date," "Last Modification Date" and "Last Access Date." These are external metadata fields and will not alter the internal metadata unique to the file type (unless the file is physically opened).

The primary concerns with the manual copy method are that the quality of the data is completely dependent on the operator performing the function and the ability to verify/authenticate the process is limited. Opposing counsel may attack the collection process and the integrity of the data as a litigation strategy. Improper collection methods may alter the content of electronic discovery collected and, in the worst case, jeopardize the admissibility or reliability of the evidence collected.

Active Data Copy

This process is designed to capture all of the "active data" on a media. Active data is the information readily available as normally seen by an operating system, including "hidden" operating system files. Traditionally tools such as Norton Ghost are used to make active data copies when transferring files from one computer to another. This is a generally accepted collection methodology when the primary concern is the content of the data and not the activity of the user. Deleted files are not active data, as they are not seen by the operating system.

The active data copy process alters the directory level metadata while maintaining the internal file metadata.

A significant limitation of this collection method is that it fails to capture so-called "inactive data," generally data that has been deleted or modified. If there are any allegations in a case regarding a data custodian's deliberate or inadvertent destruction of data, the active data copy process will not provide evidence to confirm or refute such allegations.

Using non-forensic tools such as Zcopy or Ghost through a network can be acceptable. The important issue is to define the methodology for the collection and use it consistently throughout the collection, with the appropriate checks and balances.

Forensic Image Process

The "forensic image process" is the process of creating a "mirror image" copy of a media so that both active and inactive data sets are maintained. This collection methodology is generally required when the activity of a user is as important as the content itself, or when concerns regarding the destruction of data may be raised.

Court-accepted forensic tools are used to capture every bit of electronic information stored on a hard drive. The captured information is stored in a "forensic image," which is generally encrypted and can be password protected. Authentication of the data can be verified using a "hash value," otherwise known as a digital thumbprint, to ensure that no alteration has taken place from the time of the acquisition. Hard copy "chain of custody" documentation captures important logistical information related to the acquisition of the data and is typically used to corroborate the time, place, and personnel involved in the data collection.

The forensic image process is highly detailed and generally requires that a trained forensic specialist perform this function. Substantial documentation is created and maintained so that the party performing this task may be prepared to testify concerning the adequacy and reliability of the process. As a result, it is generally recommended as a best practice to have the forensic image process performed by an objective third party to avoid any process attacks or spoliation concerns.

Organizations generally build their server architecture around redundancy and disaster recovery, so they employ technologies such as software and/or hardware RAID (shorthand for Redundant Array of Independent Disks) to prevent data loss. RAID systems use multiple hard drives and write data to those disks so that in the event of a hard drive crash, the bad drive can be replaced with no data loss to the whole system. The primary concern with creating forensic images from RAID systems is that the original RAID configuration can be very complex, and re-building it can be very complicated, especially if there is a combination hardware/software RAID.

Using a network forensic package may be a good methodology for retrieving data from a network. This process will capture the unallocated and slack areas of the network. These software packages are generally expensive and difficult to implement due to network firewalls and other pre-existing security protocols. These tools are very beneficial for doing a hybrid collection methodology, as you can then use a forensic tool to selectively identify targeted types of information from workstations and network volumes, such as email files, Microsoft Office documents, Adobe Acrobat files, etc.

Supervised Tape Archive Process ("STAP")

Collection and preservation from network servers can be problematic due to the complex hardware architecture of an organization.

In the event of dealing with complex server systems, a best practice process can be implemented to manage the time and cost of collecting and preserving information from them.

The STAP process entails plaintiff and defense counsels' agreement on the selection of a mutually acceptable expert or experts who observe the full archival of a computer system to a backup tape. The expert does not implement this process, but only looks over the shoulder of the organization's IT personnel responsible for the normal backup procedures. When the process is completed, the expert takes immediate custody of the backup tapes with the appropriate chain of custody documentation.

A limitation to this process is that deleted data resident on the hard drives of the network servers will not be captured. This limitation generally does not affect the recovery of email system servers, file servers, and database servers, but certain information, such as partially overwritten data, may not be preserved. To overcome this limitation, it may be possible, depending on the hardware and software configuration of the servers, to create forensic images of the hard drives used in the network.

Portable Storage

Backup Tapes

Backup tapes have historically been used for disaster recovery purposes and not as a primarily repository of electronic evidence. Some notable cases assumed that backup tapes were beyond the scope of normal discovery because of their traditional use and because of the cost and difficulty in retrieving specific information from backup tapes. However, as technology has reduced the cost and difficulty in accessing data from backup tapes, they have increasingly been subject to discovery. If important questions exist involving historical data that is no longer available on a live systems or the tapes fall under the definition of "ordinary course of business," they are now more likely to be discoverable as evidence.

When discoverable electronic information is contained on backup tapes, understanding the backup system and its operating protocol is essential to collect all relevant evidence and to help control costs and time. When balancing the relevance of the potential information contained on a backup tapes with the costs of processing those tapes, the type of backup software and tape type need to be identified to understand the complexity of the exercise.

"Simple backup" types are backup sessions that are linear in nature and are most commonly associated with one tape drive per computer. Examples of simple backup types are ARCserve, NT Native, and Backup Exec. Simple Backup sessions start archiving the selected data in sequential order beginning with the primary root directory (e.g., C:). This information is then written to the tape in that fashion.

"Complex backup" types are backup sessions using enterprise level backup software, such as Legato NetWorker, CommVault Galaxy, Tivoli Storage Manager (TSM), Omniback, and Netbackup. These systems employ a server-based backup system connected to a tape library. The backup server pulls packets (called "threads" or "streams") of information from the other servers connected to the backup system. The threads are numerically associated to the original server and stored onto the tape media in the tape library. Readable data is not stored on the tapes in these complex backup types, as the data is contained within the threads on the tapes.

There are two methods for restoring data from a tape backup system: native environment restoration and non-native environment restoration.

The most common way internal IT departments extract data from backup tapes is using the native environment restoration ("NE restoration") process, which requires the original software (including patches), passwords, hardware and network configuration. The NE restoration process is most appropriate when dealing with a relatively small restoration and when the restoration will not affect the current active data. However, the NE restoration process can be costly and time-consuming if a large data environment needs to be recreated, especially if the timeline for the restoration reaches far into the past.

The backup process known as non-native environment extraction ("NNE extraction") is the most common collection method employed by third-party vendors specializing in backup tape processing. The NNE extraction process may be used, with varying degrees, to extract the data from the simple backup as well as complex backup tape types.

The primary benefit of the NNE extraction process is that it is relatively quick and less expensive to perform than the NE restoration process.

Source: EDRM: (edrm.net)