Our neardup identify files and emails that are nearly identical (50% similar in text content). For example, multiple versions of a Microsoft Word document that are slightly modified over a period of time or a chain of email threads which are frequent and repetitive. In our NearDup services the identified documents and emails are grouped based on a similarity percentage, who can then perform various types of near-duplicate document analyses using the capabilities in Lexbe eDiscovery Platform. As we identify by text content after text extraction and OCR, Lexbe will identify documents as near-duplicates that are not exact duplicates. Examples of documents that might seem like exact documents, but are different in some way from the computer analysis perspective are: >Documents that were scanned at different times. >Email Threading. MORE
>Documents that were saved to PDF at different times.
>Documents with small unobservable editing changes. Duplication v. Near Duplication
Please note that Near Duplication can be used independently from Duplicate and vice-versa.
Benefits Of Near Duplication >Find similar documents once a 'hot doc' has been identified. >Reduce the chance of inadvertent privilege document release by allowing identification of documents similar to ones marked privilege or work product. >Reduce the number of documents hosted in a review environment. >Reduce review time by allowing large batches of similar documents to be reviewed and coded at one time. >Increase consistency of review by allowing similar documents to be coded the same. >Include near duplicate documents when creating review sets >Increase quality control of outgoing productions Running Near-Duplicate & Emails When Lexbe looks for near duplicates, it runs vertically through the custodian. For example, custodian-sender 'John' sent an email with an attachment to several recipients, Lexbe will look vertically in John's documents for near duplicates. Then our system will take both the email and attachment (the entire family), marks them together as near duplicate, and when you delete the John's email you will also delete the attachment. Please note that separate versions of the same document might be attached to different email-custodians, and they are not marked as exact duplicates and retained for email family integrity. So if you find all the duplicates of a document in one custodian, the same document might also appear in a different custodian's documents. You can identify related email families by content in order to identify all the emails in a group, detect missing emails, and give you the option to keep only the relevant final email messages that needs to be reviewed. MORE How It Works From the 'Case>Add Case Documents' by clicking on the 'Calculate Groups' button. This feature will only mark near-duplicates, not delete them (no batch title selection necessary). When started, all existing files with similar contents will be grouped across the entire case (e.g. 'Group 1', 'Group 2', etc). Care should be used in re-running near duplication as different files may be identified as near-duplicates in subsequent runs. Only Account Admin Users can apply near duplication (although all users can view). After clicking the 'Calculate Groups' button, you will see see a dialog box below:
Other pages in Lexbe eDiscovery Platform pages will also display the upload progress of near-duplicates. (located at the upper-right corner of a page)
The progress bar displays the number of files left being processed (near deduped). After near duplicates in the queue have been completed grouping the documents, both the message and progress bar will disappear. If you want to start productions, download files to Briefcases or apply Deduplication, please wait until the the files have been completely near deduped in the server. Please note that the time to run near duplicate can take a long time and also depends on how many new documents have been added. You can tell when it is done by checking the Progress Bar (no more processing steps running).
Lexbe eDiscovery Platform will categorize the near-duplicate files under the same group with identical numerical values (e.g., 'Group 1'), that can be accessed from Browse and Search pages (Sort, Show Fields or Select Filters) ![]() In multiple cases and high volume of documents, using filters will narrow the search for specific set of documents and choose to view only near-duplicates. It will not change the file count within a case (no automatic deletion), only categorized them under the same numeric group. You can also use this feature to create shared filters of near-duplicates. 1-Running a Filter for All Near-Duplicates. From the Search or Browse pages, click on Filter>Select Filter>Show Near Dup Groups Only. This will display all the near-duplicates classified in the groups. 2-Filtering and displaying one Group. If you want to narrow down the results and show only one specific Near Dup Group, you can also apply filters by 'Near Dup Group No.', for example 3472. 3-Sorting on Specific Groups. You can also sort on the groups by clicking on the field title 'Near Dup Group'.
4-Exporting Log to Excel. By exporting the near-duplicate log to an Excel spreadsheet, you can keep track of documents produced where privileged information might have not been removed. An Excel log allows you to filter, sort and see where if there are near-duplicates inconsistently coded (e.g., confidentiality, email threading, attachment, etc.). Manually Identifying & Reviewing Near Duplicates If needed and subject to your own manual review and quality control, you may apply further review once you have filtered near-duplicates by group and determine files for batch coding, reviewing and deletion. It is also necessary to apply a thorough manual review to be sure they are not really different documents. Here are the steps to try this approach: >From the Browse or Search pages, after choosing to show only 'New Dup Group', please go to the Field>Show Field Section>Built-in Doc Fields >Select the Original Title, Ext, Pages, Words, Size, and IsEmailAttachment columns. This will help you to consider files with similar contents. Risks Of Deleting Email Attachments We recommend that you do not delete near-duplicates marked as ‘IsEmailAttachemnt’ in Lexbe eDiscovery Platform so that users reviewing email collection will be able to establish how the content was distributed and who may have been sharing information. Often, email (MSGs) created by different custodians may also contain the same attachments. ![]() Email Threading Detect and work with similar emails part of an email chain together after applying Near Duplication to the documents in the data set. This page will help you identify related email families by content in order to identify all the emails in a group, detect missing emails, and give you the option to keep only the relevant final email messages that needs to be reviewed. MORE Consistency Check (Near-Duplicates) Identify documents that potentially should be marked responsive, privileged, work-product or confidential, based on computer identification of near-duplicates. MORE How to Identify Large NearDup Documents Grouping For more information please visit our technical page. Mass Tagging Near-Duplicates You can review and tag multiple documents detected under the near-duplicate group. MORE Further Assistance We also offer Project Management and Technical Services if engaged to support your near-duplicates efforts by helping to execute specific requests for document identification. Please contact your sales rep or our Support Center if needed. |