Automated ESI Processing

This technical note describes our automated file conversion and processing when uploading electronic stored information (ESI) to a case in Lexbe eDiscovery Platform (LEP), and what should be expected from the services.

Automated File Processing

Automated file processing refers to changing the state of ESI without the application of direct human error-checking or proofing. We utilize automated file processing to keep the costs of our eDiscovery and Platform services reasonable.  Many eDiscovery processes now involve hundreds of thousands of files, millions of page equivalents and billions of individual processes.  If human manual were required for all or even a small portion of files, page-equivalents or processes, then the cost for many jobs would be very high and cost-prohibitive.

For example, you uploaded to your case 10, 000 emails, which resulted in the total of 20, 466 documents because of the file expansion that also extracted and converted the email families (body and attachments), and other container files (e.g. zip) into searchable documents including a normalized PDF version of the native files. In addition to the expansion, during the upload our platform also creates a search index that Lexbe eDiscovery Platform prepares to make the documents in the case searchable and OCRed versions of files to include text prior to upload with DeNIST and extension repair procedures

Will every native file convert during the Automated Process?

No. Automated processing means that not every file will be converted as a human might if manually converting each file.  In large jobs processing errors are not only possible but expected.

We identify and attempt to convert (to TIFF or PDF), a wide variety of file formats such as as doc, eml, emlx, html, ics, ppt, rar, xlsx.
Please note that failure to convert a file does not mean it does not contain probative evidence, only that we did not convert it with our automated procedures. These files should be reviewed and further steps to make reviewable should be considered, when appropriate. Please review our list of supported and not supported file types.

What happens when a file type does not convert properly?

If we attempt to convert a standard file type and cannot, our procedure is to create a placeholder file and indicate in the database record that the file 'Failed to Convert'. The standard file types may fail to convert for a variety of reasons, including file corruption, file type mis-identification, print or data extraction issues, and password protection.

Some non-converted standard file types can be converted with manual technical support services (hourly or per GB charges depending on file type and issues involved).

During the native conversion, some documents may not be converted properly to PDFs. A PDF or TIFF file created as part of ESI processing and conversion might include files that will generate a Placeholder to indicate the following:

Files 'Failed to Convert': which means standard file types that might include variety of reasons, including file corruption, file type mis-identification, print or data extraction and password protection.

Files 'Not Converted'
: considered non-standard files and might include Media Files, some Container Files, some Email Files, Database Files, and other file types.

The automated ESI Processing during Document Upload includes

1-Archive/Container decompression. During upload, and processing we will expand container files. We define container files as certain compound file types that we can separate into constitute parts or reassemble, without losing data inherently as part of the conversion process. MORE

We expand the following container files automatically as part of selected automated conversion services.

Automatically expanded file types

 Extension Type
 PST Outlook message store
 RAR Archival container
 Zip Archival container

2-Metadata extraction
. Extraction of email body and attachments from Outlook MSG and associate with container MSG for load file preparation. Supported Outlook MSG and Outlook Express EML files are processed to recursively extract attachments. MORE

3-MD5 hash code generation. The MD-5 Hash is a commonly used cryptographic function that produces a 128-bit (16-byte) hash value, expressed as a hexadecimal number, 32 digits long.  MD-5 is used to check data integrity.  The value will change if there is any change in the contents of a file, but the hash value is independent of the file name, as it is stored on the operating system, not in the file. We deduplicate loose native files using an MD5 hash of each file.

File extension repair and encoding. As part of eDiscovery processing, we parse files and attempt to automatically identify file types for proper recognition and conversion services.  While file extensions are used to identify some file formats below, we do not rely on extensions to detect file formats. For example, a Word document named "sample.mp3" would still be identified as a Word document. MORE

DeNIST. As part of conversion processing certain files that are unlikely to result in evidence are removed from the data set and ignored for further processing. MORE

6-Email attachment extraction and parent email association.
The emails and their attachments are in order in Lexbe eDiscovery Platform. Email families are uploaded or processed as follows:
-The first document is the email body, while the subsequent documents will display the attachments.

7-Native text extraction. Our native extraction process works by inputting raw native file versions and flattening any attached files from container files, like an .msg file, and extracting file metadata. In the case of Outlook .ost and .pst files, that means taking each .msg file and opening and extracting data from any attachments.

8-Optical character recognition (OCR) of image files. OCR, or 'optical character recognition', is the process of taking image files (e.g., scanned from documents) and electronically converting them into searchable text. MORE

9-Full-text indexing. We index both extracted text in native files and the OCRed text from paginated versions of the same files, all in a comprehensive, combined index for fast and easy searching. MORE

10-PDF creation. All documents uploaded into our eDiscovery platform automatically have OCR applied and a PDF version created and associated with the original document.

 Custodian/Case Participants assignment to documents in Lexbe at the time of the upload, or later through Multi-Doc Edit or in the Document Viewer. This will allow you to track Custodians in Lexbe eDiscovery Suite Platform using "Case Participants", who are the litigants, deponents, witnesses, and other individuals and organizations in the case that provide factual information.  Case Participants are often Custodians of ESI in a case, and Custodians are usually Case Participants as well. MORE

Additional Features

Lexbe eDiscovery Platform also offers the following features:

-Deduplication. Lexbe eDiscovery Platform helps you to identify and narrow a document collection giving you the choice of identifying copies of duplicate files and/or removing them from your case, optimizing the review process and reducing review inconsistencies and errors and reduce review costs.

Deduplication within each custodian is called 'vertical deduplication' and between custodians is called 'horizontal deduplication'.  Generally, horizontal deduplication is not considered a best practice as it loses the association of other custodians to the deduped files.  If needed, however, it can be done in Lexbe eDiscovery Platform if custodian assignments have not been made.

When Lexbe looks for duplicates, it runs vertically through the custodian. For example, custodian-sender 'John' sent an email with multiple attachments to several recipients, Lexbe will look vertically in John's documents for exact duplicates. Then our system will take both the email and attachment (the entire family), marks them together as near duplicate, and when you delete John's email you will also delete the attachments associated to his email. MORE

-Near-Duplication. Our neardup identify files and emails that are nearly identical (50% similar in text content). For example, multiple versions of a Microsoft Word document that are slightly modified over a period of time or a chain of email threads which are frequent and repetitive. MORE

-Email Thread Identification.
Our email threading groups only similar emails (at least 50% the same) as an aid to review of overlapping content items together. This will help you identify related email families by content in order to identify all the emails in a group, detect missing emails, and give you the option to keep only the relevant final email messages that needs to be reviewed. MORE

Further Assistance

We also offer Project Management and Technical Services if engaged to support your
efforts by helping to execute specific requests for document upload. Please contact your sales rep or our Support Center if needed.