OCR

This technical note describes some aspect of Lexbe eDiscovery OCR services.

Technical Note Subject    

Optical Character Recognition (OCR)

Technical Note Details

OCR, or 'optical character recognition', is the process of taking image files (e.g., scanned from documents) and electronically converting them into searchable text. 

OCR of Single Page TIFFs

When OCRing supported images files (e.g., TIFF images), Lexbe eDiscovery creates a corresponding text file containing OCR text.  For example, if a TIFF image was named SMITH 00000100.TIF, the corresponding text file containing the results of OCR would be named SMITH 00000100.TXT.  Single page TIFFs are not OCRed in Lexbe eDiscovery Platform.

OCR of Image PDFs

Lexbe eDiscovery and Lexbe eDiscovery Platform use OCR technology to convert image-only PDFs into text-under-image PDFs.  This means the original document image (or scan) is saved and the text is added to the file in a hidden layer, so that the document can be searched (and 'copy and paste' is available), but the appearance of the document remains unchanged.  Lexbe eDiscovery and Lexbe eDiscovery Platform will OCR all image based PDFs (discarding prior OCR text and replacing) it with new OCR.

OCR Settings

When we apply OCR we use the following settings on eDiscovery jobs and for cases in Lexbe eDiscovery Platform: deskew, autorotate and despeckle.  These are general software settings and not 100% successful in their aim, as id the case for all automatic OCR software.  In particular, the autorotate function can sometime fail to properly rotate a document.  OCR is in Lexbe eDiscovery Platform and Lexbe eDiscovery jobs is set to recognize English, and not other languages (except if International OCR is ordered as part of an eDiscovery job), so quality will suffer for non-English recognition. 

Identification of Documents to OCR or re-OCR

Lexbe eDiscovery Platform will OCR all image based PDFs (discarding prior OCR text and replacing), unless 'No OCR' is selected at the time of upload.

OCR Limitations

>Lexbe eDiscovery OCR is an automated service and there is no manual review or correction as part of our service.  OCR almost invariably produces errors.

>OCR is a highly useful tool, but is far from perfect.  OCR does best with clearly readable text from high-quality scans.  OCR quality degrades with copy quality.  OCR quality can also degrade, or OCR may not be done at all, with skewed or rotated pages, pages with unusual fonts, pages with dirty or specked backgrounds, pages scanned an low resolution, etc.

>There are many reasons why OCR will not complete successfully.   File corruption is one reason.  Even when a file will open, some pages may be corrupt and prevent OCR from running successfully.  File print security is another.  For example, producers of PDFs often place print or content extraction restrictions on PDFs.  This will prevent OCR from running.   File open passwords will also prevent PDFs from OCRing.

>OCR almost always produces errors, and sometimes will produce many errors.  OCR is best thought of as an adjunct to actual review of the file itself in native, TIFF or PDF format, rather than a complete substitution.

>A non-exclusive list of other possible errors include: omitting materials to be OCRed, missing pages or files, skipping password protected files, skipping files with print, extract or other limitations on the file permissions, missing text in corrupted, of an unrecognized format, failing to recognize rotated or skewed pages.

>OCR of PDFs works on flattened image PDFs only, and not on some complex PDFs, including PDFs with embedded attachments or Portfolio PDFs.

>PDFs optionally have a number of security features like password protection, print protection and text-extraction prevention, that can complicate, confound or prevent OCR.  The PDF standard is evolving and new features added by Adobe or other developers (as part of the standard or not) can impair the ability of files to be OCRed successfully.

>Lexbe OCR recognizes Unicode and will therefore apply OCR to many non-English languages.  However, our OCR engine uses an English-language dictionary look-up only to aid in OCR accuracy, and does not use a dictionary for other languages.  This reduces the accuracy of non-English OCR.

>Lexbe OCR, and other OCR software, attempt to generate searchable text from handwriting files, but as OCR does best with printed fonts where in the words were typed, getting text from handwriting is usually not very accurate.  It works best with carefully printed handwriting and worst with sloppy writing.

Working with Cursive Handwriting Files

We created a technical note that will explain how you can work with with handwritten files in conjunction with other features in Lexbe eDiscovery Platform. MORE