Corrupt PDF Files

This technical note will show you how to detect and work with corrupt PDF files. Please note that this type of data-related issue occurs prior to uploading original files to a case in Lexbe.

How Do PDF Documents Get Corrupted?

One needs to be aware that PDF is a complex format, its specification is more than 1000 pages. Within PDFS there are embedded objects, such as different types of fonts, images or compressions, which again are complex on their own and have specifications that are even larger than the PDF specification itself.

Why Do PDF Files Corrupt?

There are uncountable different PDF products available, and virtually none of them is capable to support everything PDF offers. And only few of them actually create valid PDFs. Most freeware, or homemade PDF creators have flaws. These flaws are often not detected initially simply because the widely used PDF viewer applications detect and repair these errors on the fly. The creator of the PDF doesn't even notice his PDF is corrupt, because the PDF viewer application fixes or ignores the problem silently. A creator often does not have the goal to create a PDF, but just a PDF which can be viewed.

Reason 1: Incorrect PDF creators

PDF is a binary format. Most of its content is compressed. Editing a PDF file with a text-editor, or transmitting a PDF in text mode instead of binary mode (e.g. FTP) corrupts the PDF. Partially transmitting a PDF file cuts off part of the document, this loss of information is not recoverable.


Reason 2: Binary file is damaged

There are further reasons, but the two reasons mentioned are certainly the most common.

What Does Lexbe Do with Corrupt PDF files?

If detects, makes a placeholder file.  Please note that not all can be detected automatically during out automated process. (displays placeholder)

How to Detect Corruptions?

>Look for PDFs with Placeholders by displaying the column 'Placeholder' from the Browse or Search pages, and then open it in the Document Viewer. The most obvious way to detect a problem with a PDF document is if it doesn’t open a PDF viewer application, or there is an error message when opening the document, or part of the document cannot be displayed correctly.



>PDFs with 1 page and more than 300 words.
>Very small or very page PDFs
>PDFs with many pages (sometimes in thousands) often have corruption issues.


For most user these are the only situations where they actually are aware the document is corrupt. Any other corruption that has no direct impact to viewing the document is often ignored. If documents are being archived or must be of good quality for other reasons, they can be analyzed using a PDF analysis tool.

The 3-Heights™ PDF Analysis & Repair API analyzes documents and detects whether they are valid or not according to the PDF specification.

>Open the PDF locally. A simpler test to see whether a document is valid or not is to open it in Adobe Acrobat Professional and close it again. If one is prompted to save the document, it can be an indication that the document was corrupt and was repaired and the repaired document is now displayed to the user. This test however does not provide any information about what was corrupt, i.e. what was repaired. The save-prompt could also be unrelated to corruptions, but be of another nature, such as a Java script.

>We could also attempt to manually convert these files as part of Technical Services using different tools ($150/hr). If you want us to prepare a manual or semi-automatic file conversion and research discovery specification as part of our technical support services (hourly billing), please contact your sales representative
.

Can or Should My PDF Be Repaired?

It depends on the relevance of the documents.  You can also create a production and apply Bates stamp to the PDF version of the files. The document production will generate a briefcase containing the folders ‘ORIGINALS’ and ‘PDF’ (blended productions). In case the PDF version of the file is not legible, the same file (native format) under the Original folder can be opened in Microsoft Office 2000 or other native applications on a local computer and reviewed.

How Can a PDF Be Repaired?

You can download corrupted PDFs to your local desktop and try the following options:
1-print the file with a virtual driver which will allow you to print to PDF
2-repair files with third party software
3-print the existing PDF to paper and then re-scan in a PDF format
4-
We could also attempt to manually repair the files as part of Technical Services using different tools ($150/hr). If you want us to prepare a manual or semi-automatic file conversion and research discovery specification as part of our technical support services (hourly billing), please contact your sales representative.