What Does "Big Data" Mean in Terms of eDiscovery?

"Big data" is a term that is thrown around frequently, but what does it really mean? Moreover, how does it impact e-discovery?

Big data currently lacks a standard definition, but Oracle defines it as an aggregation of data from traditional sources (structured data), sensory sources (metadata) and social media (social data). Most data is now consumer-created and logged by major corporations. Every time you browse a website, make a purchase or post on a social media site, your information is being tracked. Large organizations use big data applications to generate intelligence about this voluminous and non-intuitive information to better serve their customers and market their goods.

Consider the data that Target collects. They amass data on your browser history, age, gender, number of kids, and more. Through trend analysis and other business intelligence, they use the data to predict future purchases and perform targeted marketing (e.g. sending you timely and relevant coupons for those products).

When Litigation Looms

So while major corporations are obtaining as much data as possible to help better target consumers, when litigation hits, this could have disastrous effects. Once litigation is reasonably foreseeable, for example, a litigation hold must be put in place. With mountains of data accumulating daily, it becomes very difficult to segregate data that may be subject to a litigation hold and cannot be deleted from data that can remain part of the routine deletion process. Great care must be taken to establish proper records retention policies and create an IT infrastructure that can easily hold segments of data.


The next major hurdle when housing so much data is actually collecting the data for a case. The variety of data sources is growing, making it difficult to collect and normalize into formats typically leveraged by e-discovery tools and practices. In addition to numerous formats, "big data" comes from many sources and resides in many locations on the network, making collection time-consuming and expensive. Furthermore, with so much data, relevant information can easily be corrupted or missed. It is imperative to keep collection logs to document where all the data is being collected and by whom.


When you get to the processing stage, you are burdened with potentially processing hundreds of gigabytes of data or more. Again, with so many file formats, it is crucial that you have the proper technology to be able to handle all the file types and the scalability to be able to handle the high volume of data. Searching on large databases can be extremely slow if the indexing engine is not capable of handling the data volume. Once the data is processed, it is necessary to host the data in a tool that can not only handle the volume, but can run complex searches over a large volume and varying file types. Relativity from kCura is a well-known application that can host masses of structured and unstructured data and which offers the scalability to handle large data sets and run complex searches through the database. With such large volumes, it is recommended that extra time is budgeted to run searches no matter the tool you choose.


Finally, review is extremely time-consuming, especially if a human eye needs to review each document. Technology similar to that used to analyze big data can also drastically reduce the time and money needed for review. This is where content analytics come into play. Analytics are becoming an essential tool, particularly for overcoming the challenges surrounding unstructured and multimedia data. In essence, computers are now necessary to battle the data explosion caused by other computers. Such analytics include near-document grouping, concept searching and assisted review. Each technique helps group data together in a way that it can be more easily reviewed, and can even eliminate chunks of non-responsive data from human review entirely.


While collecting large volumes of data helps large corporations better target their customers, it definitely brings along pitfalls when litigation arises. Big data can mean big expense if measures are not taken to write proper policies and procedures around this data and store it in locations that are easily retrievable. While inexpensive storage space makes it easy to simply store all this data for prolonged periods of time, deleting as much data as possible (once it serves no business purpose and has no legal implications) will certainly save headaches in the long run. If litigation ensues and the data does have to be collected and processed, properly logging all collection efforts will be extremely helpful if data is accidentally corrupted or is not in a format recognized by e-discovery tools. Finally, analytical tools can drastically reduce the volume of data that must be reviewed and/or streamline the process of review by clustering like documents by concept. The best bet to save costs is to avoid be sued or investigated, but since that is virtually impossible to prevent, make sure your "big data" does not cause your next "big nightmare."