How to find out if a PDF file compressed or not?

Tags: compressioncompressed xrefcompressed objectsiText 5

We are using iText to decompress PDFs, but before doing so, we want to know if an existing PDF is already compressed or not. Is there any way we can check if a PDF is compressed or not?

Posted on StackOverflow on Dec 5, 2013 by Vicky

In PDF 1.0 (1993), a PDF file consisted of a mix of ASCII characters for the PDF syntax and binary code for objects such as images. A page stream would contain visible PDF operators and operands, for instance:

56.7 748.5 m
136.2 748.5 l
S

This code tells you that a line has to be drawn (S) between the coordinate (x = 56.7; y = 748.5) because that's where the cursor is moved to with the m operator, and the coordinate (x = 136.2; y = 748.5) because a path was constructed using the l operator that adds a line.

Starting with PDF 1.2 (1996), one could start using filters for such content streams (page content streams, form XObjects). In most cases, you'll discover a /Filter entry with value /FlateDecode in the stream dictionary. You'll hardly find any "modern" PDFs of which the contents aren't compressed.

Up until PDF 1.5 (2003), all indirect objects in a PDF document, as well as the cross-reference stream were stored in ASCII in a PDF file. Starting with PDF 1.5, specific types of objects can be stored in an objects stream. The cross-reference table can also be compressed into a stream. iText's PdfReader has an isNewXrefType() method to check if this is the case. Maybe that's what you're looking for. Maybe you have PDFs that need to be read by software that isn't able to read PDFs of this type, but... you're not telling us.

Maybe we're completely misinterpreting the question. Maybe you want to know if you're receiving an actual PDF or a zip file with a PDF. Or maybe you want to data-mine the different filters used inside the PDF. In short: your question isn't very clear, and I hope this answer explains why you should clarify.