Why do PDFs change when processing them?

Tags: PdfStamperappend modeiText 5

In the next code snippet, I use PdfStamper, but I don't change anything. I just take the original metadata and I put it back unchanged:

public void manipulatePdf(String src, String dest)
        throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    Map<String, String> info = reader.getInfo();
    stamper.setMoreInfo((HashMap<String, String>) info);
    stamper.close();
    reader.close();
}
Although I didn't change anything to the src file, the dest file contains small differences. When I calculate a hash for both files, I get 2 different hash results. May I know why?

Posted on StackOverflow on Nov 6, 2014 by brian

If you read ISO-32000-1, you should know that no two PDFs are equal by design. One of the most typical differences between two PDFs is the ID:

From ISO-32000-1:

ID: An array of two byte-strings constituting a file identifier.

From Section 14.4, entitled "file identifiers":

The value of this entry shall be an array of two byte strings. The first byte string shall be a permanent identifier based on the contents of the file at the time it was originally created and shall not change when the file is incrementally updated. The second byte string shall be a changing identifier based on the file’s contents at the time it was last updated. When a file is first written, both identifiers shall be set to the same value. If both identifiers match when a file reference is resolved, it is very likely that the correct and unchanged file has been found. If only the first identifier matches, a different version of the correct file has been found.

If you create a PDF from scratch, the ID consists of two identical identifiers. When you update the PDF to add something, the first ID is preserved, the second ID is changed. If you update the PDF to remove that something, that second ID is again changed, but by definition, it should not be identical to the first ID, because you are at a different part of the workflow.

There aren't that many tools that create PDFs of which the identifiers are identical. That's because the PDF that is created from scratch is usually manipulated before the final version is saved to disk. Just create a PDF using Adobe Acrobat to reproduce this: you'll notice that the identifier pair consists of two different values. This makes that it is useless to ask: can we create a situation where we make the second identifier identical to the first one?

Moreover: it is inherent to PDF that the way objects are organized is random. Your use case using hashes goes against the PDF standard (see also the previous question).

How to solve this problem?

In an earlier question, you indicated that you want to add custom metadata and then remove it. In my answer to this question, I explained how to add metadata to an existing PDF using a PdfStamper instance:

PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));

This creates a new PDF file in which objects are being reordered. You can use PdfStamper in append mode by changing this line into:

PdfStamper stamper = new PdfStamper(reader,
    new FileOutputStream(dest), '\0', true);

Now you are creating an incremental update of your PDF file.

What is an incremental update?

Suppose that your original PDF file looks like this:

%PDF-1.4
% plenty of PDF objects and PDF syntax
%%EOF

When you use iText to manipulate such a file, you get an altered PDF file:

%PDF-1.4
% plenty of altered PDF objects and altered PDF syntax
%%EOF

During this process, objects can be renumbered, reorganized, etc... If you add something in a first go, and remove something in a second go, you can expect that the PDF looks the same to the human eye when opening the document in a PDF viewer, but you should not expect the PDF syntax to be identical.

However, when you use PdfStamper in append mode to perform an incremental update, you get an incrementally updated PDF:

%PDF-1.4
% plenty of PDF objects and PDF syntax
%%EOF
% updates for PDF objects and PDF syntax
%%EOF

In this case, the original bytes of the original PDF aren't changed. The file size gets bigger because you'll now have some redundant information (some objects will no longer be used, or you'll have an old version of some objects along with a new version), but the advantage of using an incremental update is that you can always go back to the original file.

It's sufficient to search for the second last appearance of %%EOF and to remove all the bytes that follow. You'll get a truncated PDF file like this:

%PDF-1.4
% plenty of PDF objects and PDF syntax
%%EOF

You can now take a hash of this truncated PDF file and compare it with the hash of the original PDF file. These hashes will be identical.

Caveat: beware of the whitespace characters that follow %%EOF. They can cause a minimal difference at the byte level that causes the hashes to be different.