eDiscovery Processing: Metrics

One of the biggest challenges that occurs when dealing with electronic data is estimating the volume when all that is known is the total GB to process. Since the overall volume will have significant impact on the project as a whole, it is important to understand the circumstances that will drive that estimate.

Means of Measuring

Pages

In a lot of cases the overall review time and cost for a project can be determined by the total number of pages that will be reviewed and eventually produced. This can be better estimated the more you know about the collection. If you can separate the total volume, and identify the amount of email data, application data, and non-printable data, you can get a more accurate estimate then you would based on volume alone.

Number of Documents

Since another important driver in how much effort will need to be put in to the document review is the number of documents that will be reviewed, estimating this can be a valuable statistic. Although there are quick ways to identify the number of documents in the collection, it becomes more challenging to quickly identify the documents that will be removed from the culling process.

Culling Rate

The amount of deduplication can vary greatly based on the nature of the data (backups, live data, or a combination), the scope of the deduplication (within or across custodian), and the custodian retention habits.

Searching/Filtering is another aspect that is important to consider when estimating the overall volume that will be delivered for review. Depending on the on the number of terms, and the nature of the documents the results can vary greatly.

Nonprintable Files

Nonprintable files are documents that in general will not be delivered or reviewed. Therefore it is important to exclude them from the document/GB/page estimates in order to yield more accurate results.

Industry Benchmark Survey

The table below lists some industry averages that can be used as a tool for guidance for estimating a document collection:

Benchmark

Value

 

High

Median

Low

Images[1] per GB

78,671

47,213

18,534

Images per file email

11

4

2

Images per file app files

63

10

3

Files per GB email

36,530

22,572

9,934

Files per GB app files

20,305

15,791

7,553

GB per custodian email

5

2

1

GB per custodian app files

4

1

0

Culling Rate Percentages

     

Deduplication

51%

21%

6%

Searching/Filtering

64%

61%

23%

Non-printable files

22%

5%

2%

Processing Speeds

     

Process time per GB native

117

33

11

Process time per GB image

35

32

23

Process time to first deliverable

53

35

21

Process time by file type

4

3

2

Process time by file type

6

4

3

Process time by file type

2

3

2

Quality

     

First pass quality yield %[2]

57%

78%

73%

Footnotes

  1. Images are counted one per page, so that a 4-page multi-page TIFF would count as 4 images.
  2. The percentage of data that runs through without intervention or exception handling.

 

Paper-to-Electronic Estimate Conversion Table

Boxes of Documents

Approximate Total Pages

Megabytes, Gigabytes, Terabytes

1

2,500

50

Megabytes

10

25,000

500

Megabytes

20

50,000

1

Gigabyte

100

250,000

5

Gigabyte

200

500,000

10

Gigabyte

300

750,000

15

Gigabyte

400

1,000,000

20

Gigabyte

500

1,250,000

25

Gigabyte

1,000

2,500,000

50

Gigabyte

2,000

5,000,000

100

Gigabyte

5,000

12,500,000

250

Gigabyte

10,000

25,000,000

500

Gigabyte

20,000

50,000,000

1

Terabyte

40,000

100,000,000

2

Terabyte

60,000

150,000,000

3

Terabyte

Source: EDRM: (edrm.net)