How to extract embedded streams?

Tags: inspect PDFstream objectrich mediaextract contentiText 7

I have embedded a byte array into a PDF file, more specifically an AVI file in a RichMedia annotation. Now I am trying to extract that same array. How can I do this?

Posted on StackOverflow on May 17, 2015 by Itai Soudry

I have written a brute force method to extract all streams in a PDF and store them as a file without an extension (see Extracting objects from a PDF):

public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";
 
public static void main(String[] args) throws IOException {
    File file = new File(DEST);
    file.getParentFile().mkdirs();
    new ExtractStreams().manipulate(SRC, DEST);
}
 
public void manipulatePdf(String src, String dest) throws IOException {
    PdfDocument pdfDoc = new PdfDocument(new PdfReader(new FileInputStream(src)));
    PdfObject obj;
    for (int i = 1; i <= pdfDoc.getNumberOfPdfObjects(); i++) {
        obj = pdfDoc.getPdfObject(i);
        if (obj != null && obj.isStream()) {
             byte[] b;
             try {
                  b = ((PdfStream) obj).getBytes();
             } catch (PdfException exc) {
                  b = ((PdfStream)obj).getBytes(false);
             }
             FileOutputStream fos = new FileOutputStream(String.format(dest, i));
             fos.write(b);
             fos.close();
         }
    }
    pdfDoc.close();
}

Note that I get all PDF objects that are streams. I also use two different methods:

  • When I use ((PdfStream)obj).getBytes(), iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using /FlateDecode. By using ((PdfStream)obj).getBytes(false), you will get the uncompressed PDF syntax.

  • Not all filters are supported in iText. Take for instance /DCTDecode which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use ((PdfStream)obj).getBytes(false) which is also the method you need to get your AVI-bytes from your PDF.

This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: How to change the zoom factor in link annotations?

You loop over the page dictionaries, then loop over the /Annots array of this dictionary (if it's present), but instead of checking for /Link annotations (which is what was asked in the question I refer to), you have to check for /RichMedia annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.

Click this link if you want to see how to answer this question in iText 5.