Near Duplication

Our neardup identify files and emails that are nearly identical (50% similar in text content). For example, multiple versions of a Microsoft Word document that are slightly modified over a period of time or a chain of email threads which are frequent and repetitive. In our NearDup services the identified documents and emails are grouped based on a similarity percentage, who can then perform various types of near-duplicate document analyses using the capabilities in Lexbe eDiscovery Platform. As we identify by text content after text extraction and OCR, Lexbe will identify documents as near-duplicates that are not exact duplicates. Examples of documents that might seem like exact documents, but are different in some way from the computer analysis perspective are:
>Documents that were scanned at different times.

>Email Threading. MORE
>Documents that were saved to PDF at different times.
>Documents with small unobservable editing changes.

Duplication v. Near Duplication

Please note that Near Duplication can be used independently from Duplicate and vice-versa.
         



 Types of Files
 Description  Advantages
Identical
>Two or more documents that will have the exact same extension, subject, number of words and content.
>It helps to you to speed up indexing and reduces storage costs, you can use reduced redundancy storage for noncritical data.
Exact Duplicates
>Two or more documents are exact duplicates when the textual content of the two documents is the same. A Microsoft Word file and the PDF version of that file are duplicates.

>This process will compare electronic store information (ESI) based on their characteristics and eliminating redundant data, or as the name implies, identifying duplicate files from a set of data. This process helps you to eliminate redundant documents and ensuring that only one unique file is kept in your case. It also a less time consuming review process, lowering risks where inconsistent coding of documents may occur.
 Near-Duplicates
>It contains significantly similar versions of documents that differ by, for example, a few sentences, words or paragraphs. >It allows you to find similar key documents, accelerate review from by mass tagging similar documents, checking consistency on responsiveness and privilege groupings between similar documents, email threading, etc.
         


Benefits Of Near Duplication


>Find similar documents once a 'hot doc' has been identified.
>Reduce the chance of inadvertent privilege document release by allowing identification of documents similar to ones marked privilege or work product.
>Reduce the number of documents hosted in a review environment.
>Reduce review time by allowing large batches of similar documents to be reviewed and coded at one time.
>Increase consistency of review by allowing similar documents to be coded the same.
>Include near duplicate documents when creating review sets
>Increase quality control of outgoing productions

Running Near-Duplicate & Emails

When Lexbe looks for near duplicates, it runs vertically through the custodian. For example, custodian-sender 'John' sent an email with an attachment to several recipients, Lexbe will look vertically in John's documents for near duplicates. Then our system will take both the email and attachment (the entire family), marks them together as near duplicate, and when you delete the John's email you will also delete the attachment.

Please note that separate versions of the same document might be attached to different email-custodians, and they are not marked as exact duplicates and retained for email family integrity. So if you find all the duplicates of a document in one custodian, the same document might also appear in a different custodian's documents.

You can
identify related email families by content in order to identify all the emails in a group, detect missing emails, and give you the option to keep only the relevant final email messages that needs to be reviewed. MORE

How It Works

From the 'Case>Add Case Documents'
by clicking on the 'Calculate Groups' button.   This feature will only mark near-duplicates, not delete them (no batch title selection necessary). When started, all existing files with similar contents will be grouped across the entire case (e.g. 'Group 1', 'Group 2', etc). Care should be used in re-running near duplication as different files may be identified as near-duplicates in subsequent runs. Only Account Admin Users can apply near duplication (although all users can view).

After clicking the 'Calculate Groups' button, you will see see a dialog box below:


The Case>Add Case Documents' page will also display a message indicating the near duplication is running. You can work with pages while the near duplication is in progress, and the message will disappear once the files have been near duped. New Duplication time could vary greatly depending on the size and number of files, so you can switch to other pages such as Browse or Search to review documents. 

Other pages in Lexbe eDiscovery Platform pages will also display the upload progress of
near-duplicates. (located at the upper-right corner of a page)
The progress bar displays the number of files left being processed (near deduped). 
After near duplicates in the queue have been completed grouping the documents, both the message and progress bar will disappear. 

If you want to start productions, download files to Briefcases or apply Deduplication, please wait until the the files have been completely near deduped in the server.

Please note that the time to run near duplicate can take a long time and also depends on how many new documents have been added. You can tell when it is done by checking the Progress Bar (no more processing steps running). 

How To Identify Near-Duplicates

Lexbe eDiscovery Platform will categorize the near-duplicate files under the same group with identical numerical values  (e.g., 'Group 1'), that can be accessed from Browse and Search pages (Sort, Show Fields or Select Filters)



You can also access near-duplicate files from the the Document Viewer, and the near dup groups will be identified under the 'New Duplicate' section. From this page you will be able to find similar key documents, accelerate review from by mass tagging similar documents, checking consistency on responsiveness and privilege groupings between similar documents, email threading, etc.)


Filtering Near Duplicate Groups


In multiple cases and high volume of documents, using filters will narrow the search for specific set of documents and choose to view only
near-duplicates.

It will not change the file count within a case (no automatic deletion)
, only categorized them under the same numeric group. You can also use this feature to create shared filters of near-duplicates.

1-Running a Filter for All Near-Duplicates. From the Search or Browse pages, click on Filter>Select Filter>Show Near Dup Groups Only. This will display all the near-duplicates classified in the groups.


2-Filtering and displaying one Group. If you want to narrow down the results and show only one specific Near Dup Group, you can also apply filters by 'Near Dup Group No.', for example 3472.



3-Sorting on Specific Groups. You can also sort on the groups by clicking on the field title 'Near Dup Group'.

4-Exporting Log to Excel. By exporting the near-duplicate log to an Excel spreadsheet, you can keep track of documents produced where privileged information might have not been removed. An Excel log allows you to filter, sort and see where if there are near-duplicates inconsistently coded (e.g., confidentiality, email threading, attachment, etc.).


Manually Identifying & Reviewing Near Duplicates

If needed and subject to your own manual review and quality control, you may apply further review once you have filtered
near-duplicates by group and determine files for batch coding, reviewing and deletion. It is also necessary to apply a thorough manual review to be sure they are not really different documents.
Here are the steps to try this approach:

>From the Browse or Search pages, after choosing to show only 'New Dup Group', please go to the Field>Show Field Section>Built-in Doc Fields
>Select the Original Title, Ext, Pages, Words, Size, and IsEmailAttachment columns. This will help you to consider files with similar contents.

Risks Of Deleting Email Attachments

We recommend that you do not delete
near-duplicates marked as ‘IsEmailAttachemnt’ in Lexbe eDiscovery Platform so that users reviewing email collection will be able to establish how the content was distributed and who may have been sharing information. Often, email (MSGs) created by different custodians may also contain the same attachments.

NOTE: In case there is an email attachment and a loose file that are
near-duplicates, both files will not be in the same group in Lexbe eDiscovery Platform. The attachments generally should not be deleted as  near-duplicates, unless the entire email body is as well. Otherwise, the email family will be broken and the Document Viewer will not associate them. Our system will not near dedup email attachments from different email families.

Email Threading

Detect and work with similar emails part of an email chain together after applying Near Duplication to the documents in the data set. This page will help you identify related email families by content in order to identify all the emails in a group, detect missing emails, and give you the option to keep only the relevant final email messages that needs to be reviewed. MORE

Consistency Check (Near-Duplicates)

Identify documents that potentially should be marked responsive, privileged, work-product or confidential, based on computer identification of near-duplicates. MORE

How to Identify Large NearDup Documents Grouping 

For more information please visit our technical page.

Mass Tagging Near-Duplicates

You can review and tag multiple documents detected under the near-duplicate group. MORE

Further Assistance

We also offer Project Management and Technical Services if engaged to support your
near-duplicates efforts by helping to execute specific requests for document identification. Please contact your sales rep or our Support Center if needed.