How to use a text extraction strategy after applying a location extraction strategy?

Tags: parsing PDFtext extractionextract text from locationiText 7

I used the following code to get data in PDF from a particular location.

Rectangle rect = new Rectangle(0,0,250,250);
RenderFilter filter = new RegiontextRenderFilter(rect);
fontBasedTextExtractionStrategy strategy = new fontBasedTextExtractionStrategy();
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter); //Throws Error.
I want to get the bold text present in that location. Would creating a new method or class called FontBasedTextExtractionStrategy instead of a simple TextExtractionStrategy help?

Posted on StackOverflow on Jul 1, 2014 by Raka

Please take a look at the ParseCustom example for iText 7. In this example, we create a custom TextRegionEventFilter (not ITextExtractionStrategy):

class FontFilter extends TextRegionEventFilter {
    public FontFilter(Rectangle filterRect) {
        super(filterRect);
    }
    @Override
    public boolean accept(IEventData data, EventType type) {
        if (type.equals(EventType.RENDER_TEXT)) {
            TextRenderInfo renderInfo = (TextRenderInfo) data;
 
            PdfFont font = renderInfo.getFont();
            if (null != font) {
                String fontName = font.getFontProgram().getFontNames().getFontName();
                return fontName.endsWith("Bold") || fontName.endsWith("Oblique");
            }
        }
        return false;
    }
}

This text will filter all text so that only text of which the Postscript font name ends with Bold or Oblique.

This is how you use this filter:

public void parse(String src) throws IOException {
    PdfDocument pdfDoc = new PdfDocument(new PdfReader(src));
    Rectangle rect = new Rectangle(36, 750, 523, 56);
    FontFilter fontFilter = new FontFilter(rect);
    FilteredEventListener listener = new FilteredEventListener();
    LocationTextExtractionStrategy extractionStrategy = listener.attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
    new PdfCanvasProcessor(listener).processPageContent(pdfDoc.getPage(i));
    String actualText = extractionStrategy.getResultantText();
    System.out.println(actualText);
    pdfDoc.close();
}

As you can see, we create a LocationTextExtractionStrategy that takes our self-made filter based on the font. To extract text we use processPageContent().

Click this link if you want to see how to answer this question in iText 5.