eDiscovery Collection: Cost Drivers

The cost of processing documents for electronic discovery can be tremendous. Determining which party should shoulder this cost has been a critical driver of several court decisions in which the cost for producing the electronic evidence was shifted from the producing party to the demanding party. Data that exists in antiquated systems and which was originally created and exists in multiple media formats and file structures is costly to recover and to protect for discovery. The issues that are considered in these situations have been driven by the sheer volume of this data and the technology required to recover it. It is important therefore to understand the elements that drive the cost for processing this data and any current electronic data.

The volume and composition of the dataset is a driving element in the cost of electronic discovery. As complex as identifying and collecting the data may be, the complexity of processing electronic data drives unique requirements on how the data must be handled. Converting and indexing the data into a common searchable usable format can be a labor intensive, specialized activity.

Supporting the conversion of the data are the software tools and service organizations that must be used to perform the electronic conversion of the data from numerous random formats to a common output that can be used for review. These tools can run the full range from simple tiff printers to enterprise-wide integrated products capable of processing vast collections of source files in short periods of time. The cost of these tools and services depends on the volume and rate of conversions to be performed and the time that the data must be delivered in.

Software tools and services also require infrastructure. The infrastructure may be as simple as a desktop PC or it might include networked racks of servers and roomfuls of storage arrays. Support software, such as SQL databases, and native software applications, etc. contribute to the cost of infrastructure. Therefore, the capital investment in either software tools or infrastructure on which services are deployed drives cost.

This is clearly a situation where one size does not fit all, and the need to select the correct tools, services and the correct support environment to meet the size of the discovery becomes critical. Service providers may be able to minimize costs by spreading the cost of the above elements across multiple cases.

Specialized tools and environments require a qualified and trained staff. The nature of electronic discovery requires a higher skill level than does paper document capture. Scan operators need to be replaced by IT personnel who understand and know how to operate the tools and support the environments mandated by the technology involved.

Because of the wide variability in the capability of the service providers, multiple schemas have been found for pricing these services. There is not a consistent method of pricing. Some vendors still charge via the page, some by the file count and others by the volume of gigabytes. For budgeting purposes, volume pricing is the most reliable, since you typically know how much data you are dealing with for your case. However, you should be aware that in some instances per-page or per-file pricing can be cheaper.

It is important to make sure the team is informed about the impacts decisions made in the preservation, identification, and collection phases of the electronic discovery life-cycle have on the downstream services required, such as processing and reviewing the documents. Since data drives the cost, when possible it is a good strategy to cull and filter your data before it gets to the extraction and conversion stage, making sure that the way the data is collected and filtered is negotiated and agreed upon if it could impact the integrity of the documents. Some vendors work on a media processing fee and then charge by the GB to keyword search the data.

Location of Data

The location of the data is an important factor in determining the strategy and costs affecting a collection methodology. Items that are not in daily use, such as CD/DVDs, backup tapes, and removable hard drives, can often be sent to the law firm or the vendor for processing. However, for security, legal and other reasons there are some things that simply cannot be sent. In many cases computer forensic vendors are requested to go into an organization and create forensic images of the target computer's hard drive so that the users either do not know that their computer was captured or so that their business day is not affected by the capture process. Backup tapes create special issues, because they can be placed anywhere, including offsite locations. Accurate information about the organization's disaster recovery plan, and whether any deviations or exceptions to this plan have been made, is required. In a large organization where a single disaster recovery system can take hundreds of backup tapes, it may be difficult to locate and process all of the backup tapes.

Other portable storage devices create similar priorities to quickly identify, locate and determine the best approach to restore relevant data. Additionally, offsite data housed by third parties or at the home of a custodian present unique challenges. Therefore, identifying the location and type of information to be collected is one of the most important steps to control costs in the electronic discovery process.

Keyword Searching

Expert Assistance

Doing keyword searching is truly an art form if done properly. It is recommended that an electronic evidence expert (with court experience) work with the legal team very closely throughout this process. This process is designed to assist the client design the most efficient search methodology in order to minimize the producible records (email, attachments, and user files) in the native format.

Minimizing the documents significantly reduces the cost of generating a reviewing database (concordance, summation, online repository), which is generally priced on the gigabyte input as well as the per page output. Also, by focusing on case specific search requirements, documents not relevant to the discovery request will not be included in the privilege review process, increasing the review time and minimizing nonresponsive records documentation.

The use of an expert in this process allows the producing party to have a third-party resource able to create affidavits and protocols justifying the methodology.

Statistical Sampling

This also supports the use of statistical sampling. An example would be where one of the requested keywords is "cat", but "cat" results in 1,000,000 documents that upon closer inspection appear to have no relevancy. The legal team now can try to different variations of keyword strings, such as (cat W/10 dog), which would limit the resulting documents contextually down to a reasonable 10,000 documents.

Vetting Keyword Lists

The general best practices process to vet a keyword list is:

  1. Client provides first iteration of keyword requests ("Request List") in Microsoft Word format;
  2. Expert returns red-line version of Request List and discusses the logic requirements with the Client;
  3. Client returns comments in red-line format;
  4. Expert runs test and provides statistics on results;
  5. Client signs off (via email) on final Request List;
  6. Processing begins.

This process can include the use of contextual searches as well, in order to reduce the reviewable population.

Boolean Searching

There are many different types of search engines out there that are used with varying degrees. Boolean searching is the most common way of querying data. Boolean logic refers to the logical relationship among search terms, and is named for the mathematician George Boole.

The most typical Boolean search operators are:

  1. AND -- cat AND dog -- the document must have both "cat" and "dog"
  2. OR -- cat OR dog -- the document can have either word or both words
  3. " " -- "catalog" -- the exact word "catalog" must be in the document. Another example is "cat a log" -- the exact term "cat a log" must be in the document, including the spacing.
  4. NOT -- cat NOT dog -- the document must have "cat", but if the word "dog" is in the document, it will not be a responsive document
  5. ( ) -- (cat AND dog) NOT (bird OR mouse) -- grouping of words -- the document must have both words "cat" and "dog", but if the words "bird" or "mouse" are in the document, it will not be responsive.
  6. Wildcard -- * -- cat* -- the document must have a word(s) that start with "cat", but can have any ending -- e.g., "cat", "catalog", "cats", "catastrophe", etc.
  7. W/ -- proximity search -- the combination of words need to be within a specific number of words of each other -- e.g., (cat W/5 dog) -- cat needs to be within 5 words of dog, either before or after -- "the cat ate the dog" or "the dog crept up on the cat".

Other Search Options

Other search options include:


This process compares the root forms of a search words and returns documents that contain works that derive from a common stem. For example, if a STEM search was applied to the word "instructional", documents containing the word "instruct", "instructs", and "instruction" would be returned. Stemming is significantly different from using a wildcard (*) search, since the wildcard search would be based on a group of characters, versus the linguistic analysis done in a STEM search.


Fuzzy searching allows users to find documents, even if the word being searched is misspelled. A fuzzy search is done by means of fuzzy matching software that returns a list of results based on likely relevance, even though the search string doesn't exactly match. Fuzzy searches generally can be fine tuned and ranked depending on the search engine being used. Fuzzy searches are no stranger to those that have used Optical Character Recognition ("OCR") from paper documents.

Noise Words

Depending on the search engine being used, Noise Words may become an issue. Noise Words are certain common words that are ignored by indexing or search engines. The most typical Noise Words are "the", "and", "of", "his", "my", "when", "there", "is", "are", "or", and "it". The search strategy must take into account the Noise Words prior to finalization of the term list, especially if the opposing counsel is involved in the process. It may be a very difficult to re-negotiate keywords based on a perceived technological deficiency. Each vendor and software package has different methodologies for dealing with Noise Words. It is important to have the discussion related to Noise Words prior to creating an index of the documents' words.

Asian Character Sets

Asian (or Double-Byte) character searches are another challenging piece of the keyword search puzzle.

In general it is imperative that the environment be set up properly to maximize the searching of international Unicode information. Asian and foreign language projects are generally handled by specialists, due to the complexity of the process. Most vendors can index MS Outlook mail as well as the typical user files types (i.e., Office, HTML, PDF, etc.). Once the data has been indexed by the appropriate search engine, the provided keywords can be applied.

The most common litigation support output method for Asian language cases is to create a load file with the associated TIFF/PDF and meta-data. Depending on the data set, vendors may be able to provide the OCR text of the English characters and maybe the Asian characters.

If necessary, Summary Translation services can be provided. This is the process whereby native speakers review the document and code information, including a summary of the document. This is generally a fairly expensive process, but works well when the language can't be indexed or searched via the normal lit support review tools.

As for keywords, the search terms should be provided in exactly the form as needed to be searched in a Microsoft Word document. (Word supports Unicode characters that we can export to the search tool.) One of the challenges dealing with Asian languages is that spaces are not used many between words, so words will run together. That makes doing complex searches such as "exact phrase" or "cat AND dog" very difficult. It is a general recommendation, depending on the search criteria, to create single keywords that we can put wildcards around, such as "*johnny*", so that iterations like "johnnysmith" or "johnnysaidhelikestosearchdata" will be found. Search experts will work with the legal team and assist refining the words once data is indexed.

Old Technology/Legacy Systems and Databases

Another critical factor affecting the cost of a collection effort is the currency of the technology involved. Many cases involve legacy databases that use outdated or obsolete technology, including outdated operating systems and hardware. When this is the case, unique, customized solutions are often required to collect potentially relevant data. This normally requires extensive reliance on the company's IT staff that would typically have access to the legacy operating systems and hardware.

Even when legacy data can be read and used, a database by definition contains a large quantity of data that is typically unformatted and becomes useful only when it is put into a report. Databases contain entries and complex table structures that appear nonsensical if viewed as raw data. Usually what's relevant are the reports and queries that are populated by the information in the databases. Many database systems do not permit the creation of customized reports containing the information in the form that is deemed potentially relevant in the litigation. Therefore, a third party is often needed to write customized software to extract data from various locations in the database and create a formatted document that can be reviewed.

Because of the costs involved in a customized approach to databases, an organization may be tempted to make portions of a database available to the opposing party so that it will be required to pay the costs of securing the data that it feels may be potentially relevant in the litigation. Great caution should be taken in agreeing to turn over unreviewed databases to opposing counsel. Whatever cost may be saved with this approach could be overwhelmed by the ultimate outcome of the case.

When dealing with older technology, legacy systems or databases, it is important to identify the complexity of the collection and production of that information as early as possible. It may be necessary to secure outside experts who can provide testimony regarding the complexity, timeliness or cost of securing legacy information. If the parties to the case cannot agree on a reasonable approach to this problem, a court may need to render judgment on the appropriateness and limits of collecting and producing legacy data. The courts are split over granting access to the producing party's databases for the requesting party to run searches. See, e.g., In re Honeywell Int'l, Inc. Securities Litigation, 2003 WL 22722961 (S.D.N.Y.) (for providing access); In re Ford Motor Company, 345 F.3d 1315 (11th Cir. 2003) (against providing access).

Source: EDRM (edrm.net)