How to extract text and anchor information from a PDF?

Tags: parsing PDFtext extractioninspect PDFiText 5

I am looking for a method to extract the text as well as anchor information using iText.

For example: the PDF content is "You can visit our website, XYZ, and do something" where XYZ is a clickable link. The output when extracting this content should be: "You can visit our website, XYZ ( and do something".

Basically I am trying to generate a text file with target links information.

Posted on StackOverflow on Jul 10, 2014 by user985395

The static text you can see in an PDF file is stored in content streams using PDF syntax as described in Adobe's Imaging Model.

The interactive features you can see in a PDF file are stored outside the content stream of a page in so called Annotation dictionary using the Carousel Object System (COS).

You are probably making the assumption that when you see a clickable word XYZ, there is something like <a href="">XYZ</a> inside the PDF.

There isn't.

There will be something like:

/F1 12 Tf
(XYZ )Tj

somewhere in the content stream that contains the /Contents of a page.

When you inspect the /Annots of a page, you will find something like:

  /C[0 0 1]
  /Border[0 0 0]
  /Rect[36 803.52 98.03 814.62]

as an object in your PDF file.

If you want to extract all the links and the corresponding text from a document, you need to loop over all the page dictionaries, get the /Annots, check which annotations are of subtype /Link, get the action (/A), and the coordinates (/Rect).

To know which text corresponds with the text, you need to uses iText text parser classes with a "region text" strategy and extract the text at the positions defined by the /Rect entry.