Why can't I extract text added using a Type3 font correctly from a PDF?

Tags: text extractionparsing PDFType3 fontiText 7pdfCalligraph

I have PDF file in Arabic that has text with font Type3 when I extract text using PDFBox some characters are empty and their font equals null? I want to know what is the problem.

protected void processTextPosition(TextPosition text) {
    String character=text.getCharacter(); // is empty
    String font=text.getFont().getBaseFont(); // equal null
}
The stream produced with iText looks like this: ( dJ� v{d W�cG�)Tj

Why do I get the characters in this format?

Question marks appear in my stream as "SOH-STX-ETX-EOT", not as one character. The character inside the PDF is shown as 'd' and 'J'!

Posted on StackOverflow on Feb 9, 2014 by Ayman Younis

A Type 3 font is a user-defined font. For instance: a user can define that the character 'P' corresponds with the symbol for "The Artist Formerly Known As Prince" which is a glyph, but not a letter from any known alphabet:

The TAFKAP symbol
The TAFKAP symbol

A glyph in a Type 3 font is a series of lines and shapes, and there's no way for a program such as iText or PDFBox to determine which character was meant. It is only normal that you get a question mark.

One of the following reasons applies for a PDF that contains Type 3 fonts:

  1. The font was used to introduce symbols that don't exist in any font.
  2. The font was used to obfuscate the content of the PDF so that its content can't be extracted.
  3. The PDF wasn't created in an elegant way.

If the Type 3 font was used for normal characters, you'll need to use OCR to convert the content to normal text.

Click this link if you want to see how to answer this question in iText 5.