- Data processing architecture for e-discovery
Major commercial
e-discovery vendors routinely process terabytes of data every year. Most of these vendors have automated data processing systems which churn out millions of Tiff documents and Gigabytes of loadfiles on a routine basis. This article details technical design of a typical data processing system for e-discovery and business processes associated with a largee-discovery project.Business process overview:
Input to such a data processing system is evidence collected on site which for the most part computer files on various media. This data gets collected by Forensics field personnel and gets shipped to vendor location on media drives like hard-disks, CDs, DVDs. e-discovery project managers employed by the Vendor are responsible for due diligence about the data, maintaining chain-of custody, setting client expectations and getting requested e-discovery productions delivered on time.
Typical demographics of e-discovery data:
In most of the civil procedures relevant material in context of e-discovery happens to be any form of user generated content. Examples are email communications, documents developed using
Microsoft Office or similar products, technical designs, programming code, text files, log files, images, videos etc. For all practical purposes email communications is the most important piece of evidence, but instant messages and text messsages are becoming an increasing area of interest. There are various popular email servers butmicrosoft Exchange server andIBM Lotus Domino are the most popular, however users are adapting to the fact these servers can be harvested by a centralized process, and increasingly individuals are using web-based email providers as a way to avoid corporate systems. Consequently, majority e-discovery data which vendors process are either .pst (Microsoft Outlook ) or .nsf (Lotus Notes ) files.A close look at these Email storage format reveals that they are nothing but proprietary
file systems on their own accord with emails being folders and attachments being documents/folders depending on whether they are archives or not. Emails in turn are stored into different folders like inbox, outbox etc. Consequently two or more files can be identical and be present in different email folders. Process of eliminating references to such files is called "deduplication". Deduplication is typically requested by clients to save time wasted by needlessly reviewing same documents multiple times.Deduplication however should only be performed by custodian however, as the physical location of the email within a user's folder structure might be relevant. For example, if one person in a conspiracy sent an email to four others with the email subject matter "Meet At Safe House", and one individual was placed that message in a folder called "Stuff Needed For Heist" the location of that message is as much a piece of evidence as the actual email itself.
data processing
The data received by vendor is copied to the local file system of the data center. Most of the times data is received on different dates and each wave is considered as 'batch'. Data can also be received by 'Custodian' a legal term for material parties involved. Once initial bookkeeping for the data received, it's the job of the data processing system to extract useful text and metadata from all the documents contained in the document set as fast of possible.
For this purpose some of the vendors have an in-house developed or external licensed distributed computing system which follows fundamental principles described in a popular technical paper published by
google labs [http://labs.google.com/papers/mapreduce.html] .One giant filesystem
End result of e-discovery data processing system results in something which looks similar to a giant file system which have data categorized based on when it was received or what custodian it belongs to.
Presentation of data
Even though conceptually processed data looks like a combination of different file systems, this is where creativity of different vendors kicks in. As described in original
e-discovery article, the main purpose of this data processing is for litigators to review the relevant material in the format they like. For large project this means searching and culling on data based on keywords and phrases. The "search space" is a very lucrative business and is being fiercely fought by various enterprise search solution companies who specifically serve this market.
Wikimedia Foundation. 2010.