Why is the text I extract from an English PDF page garbled?

Tags: parsing PDFtext extractiongarbled textfontsiText 7

I'm trying to extract and print English text out of a PDF on the console. Extraction is done through iText's PdfTextExtractor class. The text I'm getting is not understandable. The following code snippet represents my string extractor:

Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
    new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(input);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {
    String str=PdfTextExtractor.getTextFromPage(reader, i); 
    System.out.println(str);  
}
document.close();

The output I'm getting on console is not understandable even though the text in the PDF is in English:

t cotenn dna o mntoafinir yales r ni et h layhcsip Amgteu end y Retila m eysts e erefcern emsyst o f et h se. ru I n tioi, dnda etseh orpvedi eddda e ulav o se vdcie ollaw na s tiouquibu cacess o t latoutenxc e rpap dna t ilagid otten tofoi. nmirna ni soitaoli n mor f chea e. roth s iTh s i a cel ra csea ewerh " eth lweoh is ermo nath eth ms u fo sti rtasp ".

Can anybody please help me out what could be the possible solution for bringing text in English language as it is like in source PDF.

Posted on StackOverflow on May 16, 2014 by codechefvaibhavkashyap

If you want the text to be ordered based on its position on the page, you need to introduce a specific strategy, such as the LocationTextExtractionStrategy:

for (int i=1; i<= pdfDoc.getNumberOfPages(); i++){
    String str = PdfTextExtractor.getTextFromPage(pdfDoc.getPage(i), new LocationTextExtractionStrategy());            
}

The LocationTextExtractionStrategy sometimes results in odd sentences, more specifically if the letters 'dance' on the page (the baseline of the glyphs differs for text on the same line). In that case, you can try the SimpleTextExtractionStrategy which will return the text in the order in which it appears in the PDF syntax content stream.

Click this link if you want to see how to answer this question in iText 5.