Using pdf2Data is very simple from a code perspective. All of the code that is required is below:
// build a new Pdf2DataExtractor based on a template Pdf2DataExtractor extractor = new Pdf2DataExtractor(template); // sampleFile: the file you wish to process // targetPdf: the path where you wish to store the annotated pdf (for visual inspection) // targetXML: the path where you wish to store the extracted data (in xml format) extractor.parsePdf(sampleFile, targetPDF, targetXML);
The part that requires manual intervention is the definition of a template, which is a pdf that contains the rules for how text should be extracted from all similar pdfs. To be able to define a template, it can be done through Adobe Reader with comments or through the online demo which is located here: DEMO URL.