Deduplication

They are extra copies of identified Outlook email (PST and MSG), Outlook Express email (EML) and various loose native files such as, Excel spreadsheets, Word files, etc. that are often found in duplicate form in a case.

Lexbe eDiscovery Platform helps you to identify and narrow a document collection giving you the choice of identifying copies of duplicate files and/or removing them from your case, optimizing the review process and reducing review inconsistencies and errors and reduce review costs.

Duplication v. Near Duplication

Please note that Duplicate can be used independently from Near Duplication and vice-versa.
         



 Types of Files
 Description  Advantages
Identical >Two or more documents that will have the exact same extension, subject, number of words and content.
>It helps to you to speed up indexing and reduces storage costs, you can use reduced redundancy storage for noncritical data.
Exact Duplicates
>Two or more documents are exact duplicates when the textual content of the two documents is the same. A Microsoft Word file and the PDF version of that file are duplicates.

>This process will compare electronic store information (ESI) based on their characteristics and eliminating redundant data, or as the name implies, identifying duplicate files from a set of data. This process helps you to eliminate redundant documents and ensuring that only one unique file is kept in your case. It also a less time consuming review process, lowering risks where inconsistent coding of documents may occur.
 Near-Duplicates
>It contains significantly similar versions of documents that differ by, for example, a few sentences, words or paragraphs. >It also allows you to eliminate redundancy in the document review process and significantly reducing the number of documents to be reviewed.

         


Why Duplicates Exist

Duplicates arise because individual custodians of data may have duplicate copies of files or documents.  A common example is email.  During a collection of an existing custodian, multiple email stores may be collected to be complete and not miss anything, but as part of that process, multiple versions of the same email may be collected.  An example would be collecting a PST file on a laptop, while also collecting a Google Gmail account from the cloud. Multiple copies of the same emails might be collected.

Another example is duplicates between custodians.  If John Smith sends an email to Bill Jones, the email will be duplicated in a collection, one from each custodian.  In some companies this can lead to dozens or more of the same files. EMAIL THREADING


A third example is attachments.  One Word file might be added as attachments to multiple emails, resulting in duplication within and between custodians.



A fourth example is similar, but somewhat similar email that are part of email chains.  While similar, these copies are distinct and minor differences may be important (e.g., 'Yes, I agree', added to a 1,000 word email).

A fifth example is the same paper file scanned at different times or have different file formats. It may be the same underlying document, but each will be a distinct electronic file.  If scanned with different software the OCR text may be different.  Or the paper versions might be slightly different -- one has handwriting on it (again e.g., 'Yes, I agree') while the other does not.  Removing 'duplicates' is dangerous unless you can be sure they are 'exact' duplicates.



Types Of Duplicates

There are different types of duplicates that are important to understand to manage the process of electronic deduplication. 

Exact (Hash) duplicates. These are loose native files (not email attachments) are identified using a MD5 hash of the entire electronic file, which means that they must be exact copies. While some files might have the same content, that doesn't necessarily mean they are exact (hash) duplicates.  For example you may see the same PDF twice in the case where the first file is an original scanned PDF without OCR, and the second is the same file with OCR applied (the addition of a text layer beneath the image), resulting in two files that are not exact electronic copies.

Email Dups Based On Metadata. Emails are identified as duplicates using metadata from the email.  We use the following Outlook and Outlook Express email metadata fields for deduplication:

Sender
Recipients (including Cc and Bcc)
Email Subject Matter
Email Date & Time Sent

Corrupted emails sometimes have blank entries for one or more of the above categories; therefore, we require that all four fields be present in order to dedup an email.  Exact duplication ID will still retain documents that are from different custodians (if custodians are identified).  This is for custodian tracking and identity.  Also, separate versions of attachments to different email families are retained for email family integrity.  We recommend using NearDup and Multi-Doc editing to mass tag NearDup documents if desired.

Near Duplicates. Near duplicates are files that are very similar but do not qualify as Exact (hash) duplicates or Email duplicates.  These are the most difficult to identify and deal with in automated processes, as it is difficult or impossible to determine that the files are close enough to be exact duplicates. MORE

Deduplication Within or Between Custodian Collections

Deduplication within each custodian is called 'vertical deduplication' and between custodians is called 'horizontal deduplication'.  Generally, horizontal deduplication is not considered a best practice as it loses the association of other custodians to the deduped files.  If needed, however, it can be done in
Lexbe eDiscovery Platform if custodian assignments have not been made.

When Lexbe looks for duplicates, it runs vertically through the custodian. For example, custodian-sender 'John' sent an email with multiple attachments to several recipients, Lexbe will look vertically in John's documents for exact duplicates. Then our system will take both the email and attachment (the entire family), marks them together as near duplicate, and when you delete John's email you will also delete the attachments associated to his email.

Deduplication Process In Lexbe eDiscovery Platform

For native files uploaded to
Lexbe eDiscovery Platform or as part of Lexbe eDiscovery Services that support deduplication, we identify duplicates within a job (for eDiscovery Services) or within a case for Lexbe eDiscovery Platform. If you wish for duplicates to be deleted in a Lexbe eDiscovery job you need to request this specifically and it needs to be so indicated on the Job Acceptance email. This flexibility is often needed as custodians in a case can change.

Deduplication is done within identified Custodians only (using the Case Participant field in
Lexbe eDiscovery Platform) and not between Custodians. If no Custodians are identified in a Lexbe eDiscovery Platform case (no Case Participants assigned) or Case Participants assigned as part of an eDiscovery job, then duplication is effected across all documents.

Deduplication is available from the 'Case>Add Case Documents page >Deduplication' by clicking on the 'Dedup Case Docs' button, and this feature will mark duplicates, not delete them. When started, all existing duplicate identifications are redone across the entire case. Care should be used in re-running deduplication as different files may be identified as duplicates in subsequent runs. Only Account Admin Users can apply dedup (although all users can view and use).

How To Identify Duplicates

Lexbe eDiscovery Platform will categorize duplicate files under the column 'IsDuplicate'  that can be accessed from Browse and Search pages (Sort, Show Columns or Select Filters)


You can also access duplicate files from the the Document Viewer, identified under the 'Exact Duplicate' section. From this page you will be able to find similar key documents, accelerate review from by mass tagging similar documents, checking consistency on responsiveness and privilege groupings between similar documents, email threading, etc.)


Filter For Duplicates 

Once the deduplication is completed, go to the Browse or Search pages and apply one of the following filters:
>Duplicate>'Show Duplicates Only'. It will only display the group of files marked as duplicates.
>Duplicates>'Exclude Duplicates'. The result will hide all the files marked as 'IsDuplicate' in the case. Filters will not change the file count within a case (no automatic deletion), so for Lexbe eDiscovery Platform Storage Calculation purposes we recommend that you delete any unneeded files from the case.

When you apply filters,
Lexbe eDiscovery Platform automatically saves the records under the Filter Quick Links section and creates filter hyperlinks to open specific set of documents. You can rename them by clicking on the ‘Edit’ hyperlink.




Deleting Duplicates

To Remove the files, first you need to apply filter on 'Show Duplicates Only' = 'checked', select 'All XX Documents in Case' (if more than 25), and then click on the 'Delete Selected Docs' button. As deleted duplicates cannot be recovered, consider not deleting and filtering for view ('Select Filter>Duplicates>Exclude Duplicates'), or downloading before deleting, so a recovery is possible. We recommend that you create a download briefcase as a backup and save it to your local desktop since there is not a 100% guaranteed way to detect duplicates.

Please be careful with what you delete and also note that while some files might have the same values that don’t necessarily mean they are exact duplicates, for example you may see the same PDF twice in the case where the first file is an original PDF and the second is a scan image version with the addition of a text layer beneath the image.

Risks Of Deleting Email Attachments

We recommend that you do not delete duplicates marked as ‘IsEmailAttachemnt’ in
Lexbe eDiscovery Platform so that users reviewing email collection will be able to establish how the content was distributed and who may have been sharing information. Often, email (MSGs) created by different custodians may also contain the same attachments.

NOTE: In case there is an email attachment and a loose file that are duplicates, both files will not be marked as 'IsDuplicate' in
Lexbe eDiscovery Platform. The attachments generally should not be deleted as duplicates, unless the entire email body is as well. Otherwise, the email family will be broken and the Document Viewer will not associate them. Our system will not dedup email attachments from different email families.

Still Finding Duplicates In The Case?

That is possible, since a file can be introduced into a document collection in multiple ways. For example, different custodians might attach the same document to an email and send it to different recipients, creating separate versions of the same file. During a document collection, those separate emails can be collected and introduced into a document collection and the same attachments are also captured along with each email. Those emails are kept in different families, and deduplication will not eliminate copies in different email families.

Manually Identifying & Reviewing Duplicates

If needed and subject to your own manual review and quality control, you may group and review similar
documents and determine that some are duplicates for batch coding, even though they do not qualify as exact duplicates or hash duplicates.  Without a thorough manual review you cannot be sure they are not really different documents.  Files for this purpose can be displayed and sorted by title, extension, size or same number of words, looking for similarities that might indicate duplicates. 
Here are the steps to try this approach:
>From the Browse or Search pages, show Column Section>Built-in Doc Fields
>Select the Original Title, Ext, Pages, Words, Size, and IsEmailAttachment columns. This will help you to consider files with similar names, exact number of pages, words, etc.
>Sort on the various column head and look for similarities that suggest duplicates.
>Do not delete a file unless you are sure it really is.
>Do not delete files as duplicate if they are attachments to email.