2. Creating PDF/A files with iText

Before we start creating invoices, let's find out how to create a PDF document using iText, more specifically: how to create a PDF document in the PDF/A-3 format.

Creating a regular PDF file

Creating a PDF file with iText is very easy. It requires five simple steps:

  1. Create a Document,

  2. Create a PdfWriter that listens to the Document and writes to an OutputStream,

  3. Open the Document,

  4. Add content to the Document,

  5. Close the Document.

We'll implement these five steps to create a PDF file that looks like the document shown in Figure 2.1:

Figure 2.1: Quick brown fox jumps over the lazy dog example: regular PDF
Figure 2.1: Quick brown fox jumps over the lazy dog example: regular PDF

This PDF was created using the SimplePdf.java example:

  1. public void createPdf(String dest) throws IOException, DocumentException {
  2. // step 1
  3. Document document = new Document(PageSize.A4.rotate());
  4. // step 2
  5. PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest));
  6. writer.setPdfVersion(PdfWriter.VERSION_1_7);
  7. // step 3
  8. document.open();
  9. // step 4
  10. Paragraph p = new Paragraph();
  11. p.setFont(new Font(Font.FontFamily.HELVETICA, 20));
  12. Chunk c = new Chunk("The quick brown ");
  13. p.add(c);
  14. Image i = Image.getInstance(FOX);
  15. c = new Chunk(i, 0, -24);
  16. p.add(c);
  17. c = new Chunk(" jumps over the lazy ");
  18. p.add(c);
  19. i = Image.getInstance(DOG);
  20. c = new Chunk(i, 0, -24);
  21. p.add(c);
  22. document.add(p);
  23. // step 5
  24. document.close();
  25. }

Let's see if we can discover the five steps in iText's PDF creation process in this code snippet:

  1. In line 3, we create a Document object for a PDF with page size A4 in landscape format. By default, the A4 page is in portrait; the rotate() method changes it into landscape.

  2. In line 5-6, we create an instance of the PdfWriter class. This instance will listen to the document object and write PDF syntax to a FileOutputStream. By default, iText creates PDFs that identify themselves as PDF version 1.4 documents. In line 6, we change this to PDF 1.7.

  3. In line 9, we open the document.

  4. In lines 11 to 22, we compose content using high-level objects such as Paragraph, Chunk and Image. We add this content to the document in line 22.

  5. In line 25, we close the document.

This creates a regular PDF file. In Figure 2.2, we take a look at the Document Properties of the file. At the bottom of the Description tab, we see Tagged PDF: No.

Figure 2.2: Document properties regular PDF
Figure 2.2: Document properties regular PDF

When we look at this document, we see a sentence in which the words "fox" and "dog" are replaced by images of a fox and a dog. Being human, we'll read "The quick brown fox jumps over the lazy dog". A machine however, will only read "The quick brown jumps over the lazy" and won't know that the image of the fox and the dog are meant to be part of the sentence. This is a problem when the document is accessed by people who are visually impaired. They depend on screen readers, or on functionality such as "Read out loud" in Adobe Reader. Only when the PDF is a properly Tagged PDF, will a screen reader be able to read the full sentence.

We'll fix this problem partly by making some small changes to our code in the next couple of examples.

Creating a Tagged PDF file

Let's take a look at the TaggedPdf.java example. It is identical to SimplePdf.java, except for step 1:

  1. PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest));
  2. writer.setPdfVersion(PdfWriter.VERSION_1_7);
  4. //Make document tagged
  5. writer.setTagged();
  6. //==========

We add a single line: writer.setTagged(). As we are using high level objects such as Paragraph, Chunk, and Image, iText will mark that content as structured elements. A Paragraph will be marked as <P>, a Chunk as <Span> and an Image as <Figure>. This adds semantical information. The resulting document, shown in figure 2.3, looks identical to the one we had before, but we can now see the structure of the sentence in the Tags panel when opening the PDF in Acrobat Reader.

Figure 2.3: Tags in a Tagged PDF
Figure 2.3: Tags in a Tagged PDF

This doesn't make the document accessible (yet). It's still not possible to replace the images by a meaningful and accurate word, but the structure of the paragraph can now be interpreted by a machine (which wasn't the case before), because a PDF parser can read a structure tree that indicates that the images are part of the paragraph.

We can make the content accessible by providing alternate text for the images. But before we do so, let's take a close look at the PDF/A format.

Creating a PDF/A-3 level B file

A PDF/A file needs to be self-contained. In the previous examples, we didn't specify a font. As a result, the default font Helvetica was used. Helvetica is one of the 14 standard Type 1 fonts that are assumed to be known by every PDF viewer. iText ships with 14 Adobe Font Metrics (AFM) files that contain the font metrics that are needed to calculate the width of words and sentences. iText doesn't ship with the full fonts, so whenever Helvetica is used, that font isn't embedded in the PDF.

PDF/A requires that all fonts are embedded, so if we want to create a PDF/A-3 file, we'll have to provide a font file. PDF/A also requires an International Color Consortium (ICC) profile.

We'll introduce two constant with the paths to an ICC color file and a font file in the PdfA3b example:

  1. /** A path to a color profile. */
  2. public static final String ICC = "resources/data/sRGB_CS_profile.icm";
  3. /** A font that will be embedded. */
  4. public static final String FONT = "resources/fonts/FreeSans.ttf";
  • sRGB_CS_profile.icm is an ICC profile used to define an RGB color space.

  • FreeSans.ttf is a OpenType font with TrueType outlines that looks very much like Helvetica.

We'll need these files when we create the PDF:

  1. public void createPdf(String dest) throws IOException, DocumentException {
  2. // step 1
  3. Document document = new Document(PageSize.A4.rotate());
  4. // step 2
  5. //PDF/A-3b
  6. //Create PdfAWriter with the required conformance level
  7. PdfAWriter writer = PdfAWriter.getInstance(
  8. document, new FileOutputStream(dest), PdfAConformanceLevel.PDF_A_3B);
  9. writer.setPdfVersion(PdfWriter.VERSION_1_7);
  10. //Create XMP metadata
  11. writer.createXmpMetadata();
  12. //====================
  13. // step 3
  14. document.open();
  15. // step 4
  16. //PDF/A-3b
  17. //Set output intents
  18. ICC_Profile icc = ICC_Profile.getInstance(new FileInputStream(ICC));
  19. writer.setOutputIntents(
  20. "Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
  21. //===================
  22. Paragraph p = new Paragraph();
  23. //PDF/A-3b
  24. //Embed font
  25. p.setFont(FontFactory.getFont(
  26. FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20));
  27. //=============
  28. Chunk c = new Chunk("The quick brown ");
  29. p.add(c);
  30. Image i = Image.getInstance(FOX);
  31. c = new Chunk(i, 0, -24);
  32. p.add(c);
  33. c = new Chunk(" jumps over the lazy ");
  34. p.add(c);
  35. i = Image.getInstance(DOG);
  36. c = new Chunk(i, 0, -24);
  37. p.add(c);
  38. document.add(p);
  39. // step 5
  40. document.close();
  41. }

What's different in this code snippet when compared to our first example?

  • We use a different writer. Instead of creating an instance of PdfWriter, we now use a PdfAWriter (line 7) and we define the conformance level PDF_A_3B (line 8).

  • We add XMP metadata. In this case, the createXmpMetadata() method just creates an XML with a minimal amount of metadata (line 11).

  • We define output intents. This is where we need the ICC profile. We load it in line 18; we add it in line 19-20.

  • We embed the font. We use the FontFactory object to get a Font object, making sure that we use BaseFont.EMBEDDED. We use the setFont() object on the Paragraph instance so that the default font for all the Chunks added after the font is set, changes to this embedded font (line 25-26).

As a result, we have the same document as before, but now it verifies as a PDF/A-3B document. See figure 2.4.

Figure 2.4: a PDF/A-3B example
Figure 2.4: a PDF/A-3B example

Making this document accessible requires four more changes.

Creating a PDF/A-3 level A file

We'll conclude this chapter with the PdfA3a example:

  1. public void createPdf(String dest) throws IOException, DocumentException {
  2. Document document = new Document(PageSize.A4.rotate());
  3. //PDF/A-3a
  4. //Create PdfAWriter with the required conformance level
  5. PdfAWriter writer = PdfAWriter.getInstance(
  6. document, new FileOutputStream(dest), PdfAConformanceLevel.PDF_A_3A);
  7. writer.setPdfVersion(PdfWriter.VERSION_1_7);
  8. //====================
  10. //Make document tagged
  11. writer.setTagged();
  12. //===============
  13. //PDF/UA
  14. //Set document metadata
  15. writer.setViewerPreferences(PdfWriter.DisplayDocTitle);
  16. document.addLanguage("en-US");
  17. document.addTitle("Some title");
  18. writer.createXmpMetadata();
  19. //=====================
  20. document.open();
  21. //PDF/A-3b
  22. //Set output intents
  23. ICC_Profile icc = ICC_Profile.getInstance(new FileInputStream(ICC));
  24. writer.setOutputIntents(
  25. "Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
  26. //===================
  27. Paragraph p = new Paragraph();
  28. //PDF/UA
  29. //Embed font
  30. p.setFont(FontFactory.getFont(
  31. FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20));
  32. //==================
  33. Chunk c = new Chunk("The quick brown ");
  34. p.add(c);
  35. Image i = Image.getInstance(FOX);
  36. c = new Chunk(i, 0, -24);
  37. //PDF/UA
  38. //Set alt text
  39. c.setAccessibleAttribute(PdfName.ALT, new PdfString("Fox"));
  40. //==============
  41. p.add(c);
  42. p.add(new Chunk(" jumps over the lazy "));
  43. i = Image.getInstance(DOG);
  44. c = new Chunk(i, 0, -24);
  45. //PDF/UA
  46. //Set alt text
  47. c.setAccessibleAttribute(PdfName.ALT, new PdfString("Dog"));
  48. //==================
  49. p.add(c);
  50. document.add(p);
  51. document.close();
  52. }

These are the four changes when compared to the previous example:

  • We create a Tagged PDF. On line 11, we recognize the setTagged() method. This will create <P>, <Span> and <Figure> tags.

  • We make sure the document title is displayed. We do this by setting the viewer preference DisplayDocTitle in line 15.

  • We add more metadata. On line 16, we define the main language of the document (en-US); on line 17, we provide a title.

  • We provide Alternate text for images. On lines 39 and 47, we tell the images that they represent a "Fox" and a "Dog" by defining Alt text.

When we look at the document shown in figure 2.5, we see that the document has structure and the file knows that the image of a dog represents the dog. Just hover over the image and you'll see the alternate text appear as a tool tip.

Figure 2.5: a PDF/A-3A example
Figure 2.5: a PDF/A-3A example

The document is now accessible. When it's read using a screen reader, it will say "The quick brown fox jumps over the lazy dog."

Incidentally, the PDF/A-3 files we've created are also compliant with the PDF/A-2 standard, because we didn't add any attachments yet. They are compliant with PDF/A-1 too, because we didn't use any of the new functionality introduced in PDF/A-2.

Once we create PDFs that represent invoices, we'll want to add an XML file that conforms to the ZUGFeRD model. Before we can do so, we need a database with invoice data.