2. Creating PDF/A files with iText

Before we start creating invoices, let's find out how to create a PDF document using iText, more specifically: how to create a PDF document in the PDF/A-3 format.

Creating a regular PDF file

Creating a PDF file with iText 7 is very easy. It requires four simple steps:

  1. Create a PdfDocument that writes PDF to a PdfWriter (low-level functionality),

  2. Create a Document to which you can add simple building blocks (high-level functionality),

  3. Add content to the Document in the form of building blocks,

  4. Close the Document.

We'll implement these four steps to create a PDF file that looks like the document shown in Figure 2.1:

Figure 2.1: Quick brown fox jumps over the lazy dog example: regular PDF
Figure 2.1: Quick brown fox jumps over the lazy dog example: regular PDF

This PDF was created using the SimplePdf.java example:

  1. public void createPdf(String dest) throws IOException {
  2. // step 1
  3. PdfDocument pdfDocument = new PdfDocument(new PdfWriter(dest));
  4. pdfDocument.setDefaultPageSize(PageSize.A4.rotate());
  5. // step 2
  6. Document document = new Document(pdfDocument);
  7. // step 3
  8. document.add(
  9. new Paragraph()
  10. .setFontSize(20)
  11. .add(new Text("The quick brown "))
  12. .add(new Image(ImageDataFactory.create(FOX)))
  13. .add(new Text(" jumps over the lazy "))
  14. .add(new Image(ImageDataFactory.create(DOG))));
  15. // step 4
  16. document.close();
  17. }

We can easily discover the four steps in iText's PDF creation process in this code snippet:

  1. In line 3-4, we create a PdfDocument object for a PDF document with page size A4 in landscape format. By default, the A4 page is in portrait; the rotate() method changes it into landscape.

  2. In line 6, we create an instance of the Document class. PdfDocument is a low-level class that can be used for low-level operations; Document is a high-level class to which we can add high-level objects.

  3. In lines 8 to 14, we compose content using high-level objects such as Paragraph, Text and Image. We add this content to the document.

  4. In line 16, we close the document.

This creates a regular PDF file. In Figure 2.2, we take a look at the Document Properties of the file. At the bottom of the Description tab, we see Tagged PDF: No.

Figure 2.2: Document properties regular PDF
Figure 2.2: Document properties regular PDF

When we look at this document, we see a sentence in which the words "fox" and "dog" are replaced by images of a fox and a dog. Being human, we'll read "The quick brown fox jumps over the lazy dog". A machine however, will only read "The quick brown jumps over the lazy" and won't know that the image of the fox and the dog are meant to be part of the sentence. This is a problem when the document is accessed by people who are visually impaired. They depend on assistive technology (AT) such as the "Read out loud" functionality in Adobe Reader. Only when the PDF is a properly Tagged PDF, will AT be able to read the full sentence.

We'll fix this problem partly by making some small changes to our code in the next couple of examples.

Creating a Tagged PDF file

Let's take a look at the TaggedPdf.java example. It is identical to SimplePdf.java, except for step 1:

  1. // step 1
  2. PdfDocument pdfDocument = new PdfDocument(new PdfWriter(dest));
  3. pdfDocument.setDefaultPageSize(PageSize.A4.rotate());
  4. pdfDocument.setTagged();

We add a single line: pdfDocument.setTagged(). As we are using high level objects such as Paragraph, Text, and Image, iText will mark that content as structured elements. A Paragraph will be marked as <P>, a Text object as <Span> and an Image as <Figure>. This adds semantical information. The resulting document, shown in figure 2.3, looks identical to the one we had before, but we can now see the structure of the sentence in the Tags panel when opening the PDF in Acrobat Reader.

Figure 2.3: Tags in a Tagged PDF
Figure 2.3: Tags in a Tagged PDF

This doesn't make the document accessible (yet). It's still not possible to replace the images by a meaningful and accurate word, but the structure of the paragraph can now be interpreted by a machine (which wasn't the case before). Now that we have tagged the document, a PDF parser can read a structure tree that indicates that the images are part of the paragraph.

We can make the content accessible by providing alternate text for the images. But before we do so, let's take a close look at the PDF/A format.

Creating a PDF/A-3 level B file

A PDF/A file needs to be self-contained. In the previous examples, we didn't specify a font. As a result, the default font Helvetica was used. Helvetica is one of the 14 standard Type 1 fonts that are assumed to be known by every PDF viewer. iText ships with 14 Adobe Font Metrics (AFM) files that contain the font metrics that are needed to calculate the width of words and sentences. iText doesn't ship with the full fonts, so whenever Helvetica is used, that font isn't embedded in the PDF.

  • PDF/A requires that all fonts are embedded, so if we want to create a PDF/A-3 file, we'll have to provide a font file.

  • PDF/A also requires an International Color Consortium (ICC) profile.

We'll introduce two constants with the paths to an ICC color file and a font file in the PdfA3b example:

public static final String FONT = "resources/fonts/OpenSans-Regular.ttf";
public static final String ICC = "resources/color/sRGB_CS_profile.icm";

We'll need these files when we create the PDF.

  • FreeSans.ttf is an OpenType font program with TrueType outlines that we can embed.

  • sRGB_CS_profile.icm is an ICC profile used to define an RGB color space.

Let's take a look at the createPdf() method:

  1. public void createPdf(String dest) throws IOException {
  2. // step 1
  3. PdfADocument pdfDocument = new PdfADocument(
  4. new PdfWriter(dest), PdfAConformanceLevel.PDF_A_3B,
  5. new PdfOutputIntent("Custom", "", "http://www.color.org",
  6. "sRGB IEC61966-2.1", new FileInputStream(ICC)));
  7. pdfDocument.setDefaultPageSize(PageSize.A4.rotate());
  8. // step 2
  9. Document document = new Document(pdfDocument);
  10. // step 3
  11. PdfFont font = PdfFontFactory.createFont(FONT, true);
  12. document.add(new Paragraph().setFont(font).setFontSize(20)
  13. .add(new Text("The quick brown "))
  14. .add(new Image(ImageDataFactory.create(FOX)))
  15. .add(new Text(" jumps over the lazy "))
  16. .add(new Image(ImageDataFactory.create(DOG))));
  17. // step 4
  18. document.close();
  19. }

What's different in this code snippet when compared to our first example?

  • We use a different PDF document object. Instead of creating an instance of PdfDocument, we now use a PdfADocument (line 3) and we define the conformance level PDF_A_3B (line 4).

  • A stream is created containing metadata in the eXtensible Metadata Platform (XMP). You don't see this in the code, but iText will add this metadata automatically since XMP is a requirement for PDF/A documents.

  • We define output intents. This is where we need the ICC profile (line 5-6).

  • We embed the font. We use the FontFactory object to get a PdfFont object, making sure that we set the embedded flag to true (line 11). We use the setFont() object on the Paragraph instance (line 12) so that embedded OpenSans-Regular is used instead of the default Helvetica font.

As a result, we have the same document as before, but now it verifies as a PDF/A-3B document. See figure 2.4.

Figure 2.4: a PDF/A-3B example
Figure 2.4: a PDF/A-3B example

Making this document accessible requires a handful of extra changes.

Creating a PDF/A-3 level A file

We'll conclude this chapter with the PdfA3a example:

  1. public void createPdf(String dest) throws IOException {
  2. // step 1
  3. PdfADocument pdfDocument = new PdfADocument(
  4. new PdfWriter(dest), PdfAConformanceLevel.PDF_A_3A,
  5. new PdfOutputIntent("Custom", "", "http://www.color.org",
  6. "sRGB IEC61966-2.1", new FileInputStream(ICC)));
  7. pdfDocument.setDefaultPageSize(PageSize.A4.rotate());
  8. pdfDocument.setTagged();
  9. pdfDocument.getDocumentInfo().setTitle("The fox and the dog");
  10. pdfDocument.getCatalog().setViewerPreferences(
  11. new PdfViewerPreferences().setDisplayDocTitle(true));
  12. pdfDocument.getCatalog().setLang(new PdfString("en-US"));
  13. // step 2
  14. Document document = new Document(pdfDocument);
  15. // step 3
  16. PdfFont font = PdfFontFactory.createFont(FONT, true);
  17. Image fox = new Image(ImageDataFactory.create(FOX));
  18. fox.getAccessibilityProperties().setAlternateDescription("fox");
  19. Image dog = new Image(ImageDataFactory.create(DOG));
  20. dog.getAccessibilityProperties().setAlternateDescription("dog");
  21. document.add(
  22. new Paragraph()
  23. .setFont(font)
  24. .setFontSize(20)
  25. .add(new Text("The quick brown "))
  26. .add(fox)
  27. .add(new Text(" jumps over the lazy "))
  28. .add(dog));
  29. // step 4
  30. document.close();
  31. }

We've applied the following changes when compared to the previous example:

  • We change the conformance level to PDF_A_3A. That's just a matter of changing one parameter in the PdfADocument constructor (line 4).

  • We create a Tagged PDF. On line 8, we recognize the setTagged() method. This will create <P>, <Span> and <Figure> tags. (This is a PDF/A-3a and a PDF/UA requirement.)

  • We add a title to the metadata. In line 9, we set the title to "The fox and the dog." (This is a PDF/UA requirement, not a PDF/A requirement.)

  • We make sure the document title is displayed. We do this by setting the viewer preference DisplayDocTitle in line 10-11. (This is a PDF/UA requirement, not a PDF/A requirement.)

  • We define the language used in the document. On line 12, we tell the document that its contents are in American English. (This is a PDF/UA requirement, not a PDF/A requirement.)

  • We provide Alternate text for images. On lines 18 and 20, we tell the images that they represent a "fox" and a "dog" by defining alternate text. (This is a PDF/UA requirement.)

When we look at the document shown in figure 2.5, we see that the document has structure and the file knows that the image of a dog represents the dog. Just hover over the image and you'll see the alternate text appear as a tool tip.

Figure 2.5: a PDF/A-3A example
Figure 2.5: a PDF/A-3A example

The document is now accessible. Mind the title that is shown in the title bar, and the tooltip "dog" that is shown when you hover over the image of the dog. When the document is read using a AT, it will now read "The quick brown fox jumps over the lazy dog."

If you look at the left panel, you can see that the document is assumed to be a PDF/A-3A document as well as a PDF/UA document, because we also introduced some features that are required by the PDF/UA standard. Adobe Acrobat can't verify the compliance with these standards. PDF/UA in general can't be verified programmatically because it takes a human to check whether a document is properly tagged. Incidentally, the PDF/A-3 files we've created are also compliant with the PDF/A-2 standard, because we didn't add any attachments yet.

Once we create PDFs that represent invoices, we'll want to add an XML file that conforms to the ZUGFeRD model. Before we can do so, we need a database with invoice data.