Data processing architecture for e-discovery

Data processing architecture for e-discovery

Major commercial e-discovery vendors routinely process terabytes of data every year. Most of these vendors have automated data processing systems which churn out millions of Tiff documents and Gigabytes of loadfiles on a routine basis. This article details technical design of a typical data processing system for e-discovery and business processes associated with a large e-discovery project.

Business process overview:

Input to such a data processing system is evidence collected on site which for the most part computer files on various media. This data gets collected by Forensics field personnel and gets shipped to vendor location on media drives like hard-disks, CDs, DVDs. e-discovery project managers employed by the Vendor are responsible for due diligence about the data, maintaining chain-of custody, setting client expectations and getting requested e-discovery productions delivered on time.

Typical demographics of e-discovery data:

In most of the civil procedures relevant material in context of e-discovery happens to be any form of user generated content. Examples are email communications, documents developed using Microsoft Office or similar products, technical designs, programming code, text files, log files, images, videos etc. For all practical purposes email communications is the most important piece of evidence, but instant messages and text messsages are becoming an increasing area of interest. There are various popular email servers but microsoft Exchange server and IBM Lotus Domino are the most popular, however users are adapting to the fact these servers can be harvested by a centralized process, and increasingly individuals are using web-based email providers as a way to avoid corporate systems. Consequently, majority e-discovery data which vendors process are either .pst (Microsoft Outlook) or .nsf (Lotus Notes) files.

A close look at these Email storage format reveals that they are nothing but proprietary file systems on their own accord with emails being folders and attachments being documents/folders depending on whether they are archives or not. Emails in turn are stored into different folders like inbox, outbox etc. Consequently two or more files can be identical and be present in different email folders. Process of eliminating references to such files is called "deduplication". Deduplication is typically requested by clients to save time wasted by needlessly reviewing same documents multiple times.

Deduplication however should only be performed by custodian however, as the physical location of the email within a user's folder structure might be relevant. For example, if one person in a conspiracy sent an email to four others with the email subject matter "Meet At Safe House", and one individual was placed that message in a folder called "Stuff Needed For Heist" the location of that message is as much a piece of evidence as the actual email itself.

data processing

The data received by vendor is copied to the local file system of the data center. Most of the times data is received on different dates and each wave is considered as 'batch'. Data can also be received by 'Custodian' a legal term for material parties involved. Once initial bookkeeping for the data received, it's the job of the data processing system to extract useful text and metadata from all the documents contained in the document set as fast of possible.

For this purpose some of the vendors have an in-house developed or external licensed distributed computing system which follows fundamental principles described in a popular technical paper published by google labs [http://labs.google.com/papers/mapreduce.html] .

One giant filesystem

End result of e-discovery data processing system results in something which looks similar to a giant file system which have data categorized based on when it was received or what custodian it belongs to.

Presentation of data

Even though conceptually processed data looks like a combination of different file systems, this is where creativity of different vendors kicks in. As described in original e-discovery article, the main purpose of this data processing is for litigators to review the relevant material in the format they like. For large project this means searching and culling on data based on keywords and phrases. The "search space" is a very lucrative business and is being fiercely fought by various enterprise search solution companies who specifically serve this market.


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Data management — comprises all the disciplines related to managing data as a valuable resource. Contents 1 Overview 2 Topics in Data Management 3 Body Of Knowledge 4 Usage …   Wikipedia

  • Data, context and interaction — (DCI) is a paradigm used in computer software to program systems of communicating objects. Its goals are: To improve the readability of object oriented code by giving system behavior first class status; To cleanly separate code for rapidly… …   Wikipedia

  • Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… …   Wikipedia

  • Data visualization — A data visualization of Wikipedia as part of the World Wide Web, demonstrating hyperlinks Data visualization is the study of the visual representation of data, meaning information that has been abstracted in some schematic form, including… …   Wikipedia

  • Data analysis — Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches,… …   Wikipedia

  • Latin American architecture — Introduction       history of architecture in Mesoamerica, Central America, South America, and the Caribbean beginning after contact with the Spanish and Portuguese in 1492 and 1500, respectively, and continuing to the present.       For… …   Universalium

  • Service-oriented architecture — (SOA) is a method for systems development and integration where functionality is grouped around business processes and packaged as interoperable services . SOA also describes IT infrastructure which allows different applications to exchange data… …   Wikipedia

  • Microsoft Data Access Components — MDAC redirects here. For other uses, see MDAC (disambiguation). MDAC (Microsoft Data Access Components) Microsoft Corporation s MDAC provides a uniform framework for accessing a variety of data sources on their Windows platform. Developer(s)… …   Wikipedia

  • Aster Data Systems — Infobox Company company name = Aster Data Systems company company type = Private foundation = 2005 location = Redwood City, California United States key people = Mayank Bawa – CEO and co founder; Tasso Argyros – CTO, VP of Engineering, and co… …   Wikipedia

  • gold processing — Introduction  preparation of the ore for use in various products.       For thousands of years the word gold has connoted something of beauty or value. These images are derived from two properties of gold, its colour and its chemical stability.… …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”