eDiscovery Records Management: Archiving

Based on the current e-discovery landscape, corporations implementing e-discovery software solutions typically find themselves examining their broader records management policies, which include the information needs of each line of business and company administration. With e-mail databases growing at an estimated 20% per annum, users spend 25% of their workday managing their email inbox, and on average send and receive 100 emails per day.[1] Implementing an archive that supports e-mail but can be used for other enterprise content is becoming a necessity.

Archives v. Backups

With most archive solutions available on the market, as email and instant messages are sent and received, each message is bifurcated and copied into the archive before it reaches the email server, or whatever instant messaging infrastructure is in use. Message integrity needs to be maintained to avoid spoliation concerns, and all messages end up in the archive. An archive has a broad impact on the overall corporate IT infrastructure.

By contrast, this is not the case for daily backups, due to ongoing deletion of messages by users. Backups represent a point-in-time snapshot of email data, but are not intended in any way to provide an archive or compliance solution of any sort. Companies relying on backup media for records management purposes end up having to spend substantial amounts on consulting engagements with forensics expert or on in-house resources using recovery-from-backup solutions to recover email from backups.

Storage Optimization

Additionally, as most archive solutions extract mail items from production mailboxes and replace them with stub messages from the email server, the storage burden on production mail servers is reduced and most content is moved to a separate storage area where messages and attachments are compressed and stored securely in the archive. Users can continue to perform all of their typical email functions.

PST Elimination or Controls

Since archives typically afford every end user a larger long-term storage quota, corporate IT departments can take advantage of archive deployments to implement controls or elimination of PST capabilities. From an e-discovery perspective, this eliminates desktop/other network drive email collections, reducing costs and providing more complete assurances that requested files are actually provided.

Traditionally in the backup restore world, e-discovery review has involved a snapshot of the user's folder structure, with native files in labeled folders. This is possible when simply stubbing mail to the archive, because IT is "offloading" data from the servers at a point in time, much like it would to create backup tapes. However, if the files were journaled before hitting the email/file server, then the user's folder structure is not captured. As discussed, journaling goes beyond a point in time and captures all files in transit, before the file reaches its destination. Specifically, the corporation is protected against inter-day deletion, which can act like an underground archive if the user commonly practices deletion during the course of business each day, by creating records that are sent to others for which he has left no visible trace. In many cases e-mail archives also utilize hash values to assure the single instance of each file.

Single-Instance Storage and De-Duplication

In order to both reduce network storage management and to simplify the review phase of the e-discovery cycle, archive should ideally create and maintain a single instance store (SIS) of each message, or ensure global de-duplication some other way. Typically, hash values of some sort are used to ensure uniqueness, since a hash value represents a digital fingerprint for the message body and metadata of each email item. Each user file is then ideally de-duplicated by cross-checking the hashed values to all files in the archive. Archives that implement SIS or de-duping typically reduce the amount of data stored in the archive by 60 - 90%, thus reducing the cost of storage, increasing search speeds and reducing the amount of time costly legal resources are tied up reviewing the corpus of data of interest.

Litigation Holds

In the case of a litigation hold, an archive should ideally be able to preserve evidence for each specific legal investigation. Based on the scope of the preservation order (i.e., to include such items as all files for a certain number of users for a certain period), an archive should be able to tag these files in the archive so that they are locked down. (Of course other technologies and procedures need to be implemented in order to avoid spoliation of data outside the archive, since an archive never contains all the data in an environment.) Litigation holds need to override any purge policies and must be tamper-proof and auditable so that users or administrators cannot delete affected files. Users ideally should not even know which records are under a litigation hold.

If the corporation is involved in class action litigation, multi-jurisdictional or pattern litigation matter(s), and has multiple outside legal counsel conducting discovery review and production on its behalf, then the collection, preparation and litigation hold management can typically be done by corporate IT and legal in the archive.

Searchable Metadata

Any archive should be able to track metadata on an item-by-item basis, with granularity down to the message and attachment levels. Metadata should include such fields such as blind carbon copy or distribution list membership at the time a message was sent in order to ensure that a comprehensive view of Chain of Custody (CoC) can be pieced together. For companies running Exchange, this of course requires that envelope journaling be used, requiring Exchange 2000 Server SP3 or later.

When native file extraction occurs upon ingestion into the archive, it is critical to extract the file's properties at an in-depth level. For example, the extraction of file properties of a Microsoft Excel spreadsheet for the dates in which the file was created, modified, accessed, printed and edited will allow the investigator to search and sort on these fields to investigate relevance to other user files and to understand the user's activity at a particular time.

Archive Review

Native File Review

Typically the archive is responsible for the full-text index, the metadata, and the associated native file. Indexing occurs in the normal course of business each day, and can be done reactively after legacy data is imported into an archive in response to a discovery request.

An archive absolutely needs to maintain parent-child relationships between messages and their attachments, and ideally also includes support for conversational threading and other fuzzy search capabilities.

File Tags and Annotations

When it is necessary to utilize disparate teams of attorneys and their staff to review files from the archive, capturing metadata including file tags and annotations from different reviewers can be critical for accuracy and efficiency. This necessitates an archive solution that allows review and tagging of all collected and prepared files for relevancy, privilege, confidentiality, etc. for each specific matter. This can facilitate shared review of tagging decisions across similar issue matters, the elimination or carve-out of review of duplicates, and a concentrated review of privileged documents.

Hosted Review

Additionally, for large discovery operations, some archives provide hosted review capabilities in which instances of matter-specific workspaces can be made available. Hosted review provides for a quickly scalable, highly time-sensitive process, with all exception preparation, legal review and production decisions ideally tracked through built-in project management functionality in the archive.


Any archival system needs some sort of reporting mechanism. This should include reporting around audit logs, around search criteria, around privilege logs, around metadata and document information for files not produced to opposing counsel or government investigators/regulators, and so on. The following are some examples of reports and report functionality:

  • Ability to report on message headers and metadata
  • The ability to automatically generate reports on specific searches, displaying all search parameters along with results from the search
  • The ability to re-run saved searches once additional data has been added to the archive
  • Ability to create reports based around any archive workflow as applicable, depending on archive
  • The ability to report on review performance, including # of files reviewed per hour, and overall review progress (percentage complete)
  • Report on specific messages and the level of review they have received, i.e., "1st review," "2nd review," "3rd review," etc.
  • Reporting on details of which reviewing tasks were performed by which reviewers on which items
  • Reports containing cross reference files for Bates numbers and file IDs
  • Reports displaying Tag, Annotation, highlight, or redacted documents
  • Reports should be exportable to tab-deliminted, comma delimited, Word, Excel, XML and ideally litigation support review databases


  1. Julie Marobella, Sr. Research Analyst, Records Management and Compliance Infrastructure, IDC, 4/21/2005.

Source: EDRM (edrm.net)