Chapter 4: Creating reports using pdfHTML

Tags: pdfHtmlHTML to PDFXMLXSLiText 7HTMLeBookconverting HTML to PDF with pdfHTML

Roughly speaking, there are three major ways to create PDF documents using iText,

  1. You can create a PDF document from scratch using iText objects such as Paragraph, Table, Cell, List,... The advantage of this approach is that everything is programmable, hence configurable just the way you want it. The disadvantage is that you need to program everything; even small changes such as changing one color into another, require a developer to change the Java code of the application, to recompile the code, etc.

  2. You can fill out a pre-existing form. On one side, there is AcroForm technology, which is fast and easy, but not dynamic (all fields have fixed positions). On the other side, you have the XML Forms Architecture (XFA), which is dynamic, filling out the form is easy, but the form creation is complex, and XFA is deprecated since PDF 2.0.

  3. You can convert HTML and CSS to PDF using pdfHTML. This is easy because everyone knows some HTML, and everyone knows some CSS. Why would you create a template in another (proprietary?) format? Just create the content in HTML, then convert that content to PDF with pdfHTML using CSS for the definition of the styles.

This tutorial discusses this third approach, which is ideal when you have to create documents of a certain type, for instance catalogues, invoices, etc.

Description of a use case

Assume that you are a service provider in the business of creating invoices for different customers. All of these invoices share a similar structure regardless of the customer, but every customer wants you to use different fonts, different colors, a different layout. If you use the first approach, you'll have to write Java code every time a new customer signs up. If you use the second approach, you'll discover that you soon hit the limitations of the existing forms technology in PDF. If you use pdfHTML, you can build a system that requires a minimum of programming, and that doesn't take much effort to sign up a new customer.

When a new customer signs up, you need:

  • To get the data in such a way that it can easily be used to populate your HTML,

  • To get information about fonts, colors, layout,... in the form of a CSS file,

  • To get a single-page PDF document that can serve as company stationery.

In this chapter, we'll work with an XML file, movies.xml, containing data that will be presented in different ways using different XSLT transformations.

Figure 4.1: The movies.xml data file
Figure 4.1: The movies.xml data file

Figure 4.1 shows that the root element of this XML file is called <movies>, and that the XML file consists of a series of <movie> tags containing information about a movie, such as the IMDB id (<id>), a title (<title>), the year in which the movie was produced (<year>), the director (<director>), a description <description>, and the file name of the movie poster (<poster>).

We'll use this XML file for all the examples in this chapter, but the resulting PDFs will be quite different.

Converting XML to HTML using XSLT

In a first series of examples, we are going to use an XSL transformation that converts the XML into an HTML files consisting of one large table.

Figure 4.2: XSL to transform the XML into HTML with a table
Figure 4.2: XSL to transform the XML into HTML with a table

When we examine the movies_table.xsl XSLT code in figure 4.2, we recognize the structure of an HTML page that will match the <movies> root element. We define a <table> object, and we use apply-templates which will, in this case, generate two rows for every <movie> tag. These rows will be populated with the movie data.

We don't use any external CSS, but there is some internal CSS in which we define pseudo-classes for the rows. Every odd row (tr:nth-child(odd)) will have #cc66ff as background color; every even row (tr:nth-child(even)) will have #ffff99 as background-color.

The result, shown in figure 4.3, is quite colorful –I apologize if it hurts the eyes, but remember that his is just an example to demonstrate the functionality.

Figure 4.3: the XML file rendered as a table in PDF
Figure 4.3: the XML file rendered as a table in PDF

We call the createPdf() method of the C04E01_MovieTable.java like this:

app.createPdf(app.createHtml(XML, XSL), BASEURI, DEST);

We don't store the HTML on disk; the createHtml() method creates the HTML file in memory using a ByteArrayOutputStream.

public byte[] createHtml(String xmlPath, String xslPath)
    throws IOException, TransformerException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    Writer writer = new OutputStreamWriter(baos);
    StreamSource xml = new StreamSource(new File(xmlPath));
    StreamSource xsl = new StreamSource(new File(xslPath));
    TransformerFactory factory = TransformerFactory.newInstance();
    Transformer transformer = factory.newTransformer(xsl);
    transformer.transform(xml, new StreamResult(writer));
    writer.flush();
    writer.close();
    return baos.toByteArray();
}

We pass the HTML bytes as a byte[] to the createPdf() method:

public void createPdf(byte[] html, String baseUri, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    HtmlConverter.convertToPdf(
        new ByteArrayInputStream(html), new FileOutputStream(dest), properties);
}

We use ConverterProperties so that the links to the images can be resolved.

We'll reuse the createHtml() method in all the other examples of this chapter. For instance in the next example where we'll introduce company stationery as a background image.

Adding a background and a custom header or footer

Suppose that we have a single-page PDF document that can be used as company stationery –see the PDF on the left in figure 4.4. Suppose that we want to add this single page in the background of the PDF we are creating from HTML –see the PDF on the right. Suppose that we also want to add page numbers in a way that is not supported by the @page at-rule. See for instance the large, white number 1 on the first page of the resulting PDF.

Figure 4.4: Using a single-page PDF as company stationery
Figure 4.4: Using a single-page PDF as company stationery

Chapter 7 of the iText 7: Building Blocks tutorial explains how you can meet these requirements. You can achieve this by using an event handler.

In the C04E02_MovieTable2 example, we create an IEventHandler implementation, named Background:

  1. class Background implements IEventHandler {
  2. PdfXObject stationery;
  3.  
  4. public Background(PdfDocument pdf, String src) throws IOException {
  5. PdfDocument template = new PdfDocument(new PdfReader(src));
  6. PdfPage page = template.getPage(1);
  7. stationery = page.copyAsFormXObject(pdf);
  8. template.close();
  9. }
  10.  
  11. @Override
  12. public void handleEvent(Event event) {
  13. PdfDocumentEvent docEvent = (PdfDocumentEvent) event;
  14. PdfDocument pdf = docEvent.getDocument();
  15. PdfPage page = docEvent.getPage();
  16. PdfCanvas pdfCanvas = new PdfCanvas(
  17. page.newContentStreamBefore(), page.getResources(), pdf);
  18. pdfCanvas.addXObject(stationery, 0, 0);
  19. Rectangle rect = new Rectangle(36, 32, 36, 64);
  20. Canvas canvas = new Canvas(pdfCanvas, pdf, rect);
  21. canvas.add(
  22. new Paragraph(String.valueOf(pdf.getNumberOfPages()))
  23. .setFontSize(48).setFontColor(Color.WHITE));
  24. canvas.close();
  25. }
  26. }

We can create an instance of the Background class using the following parameters (line 4):

  • a PdfDocument instance of the document we are creating, pdf, and

  • the path to the source of the single-page PDF, src.

In this constructor, we read the single-page PDF into another PdfDocument instance, named template (line 5). We get the first page of this template (line 6), and we copy this page to the pdf instance as a Form XObject (line 7). This Form XObject is stored as a member-variable, stationery (see line 2). Finally, we close the template (line 8).

When an event is triggered, that event is handled by the handleEvent() method, which we override in line 12. We get a PdfCanvas object (line 16) for the current page (line 15) in the current PDF document (line 14). We want to get access to a canvas that will be drawn before anything else is drawn on the page. That's what newContentStreamBefore() means in line 17. We add the stationery Form XObject to the page at coordinates [ x=0, y=0 ] (line 18). The addXObject() method adds the single page we imported in the constructor to the current page as its background.

To add the page number, we first define a location (line 19), and we create a high-level Canvas object using the low-level PdfCanvas instance. To this canvas, we add the current number of pages as a Paragraph (line 22) with 48pt as font size and white as text color (line 23).

When will this event be triggered? That's defined in the createPdf() method:

  1. public void createPdf(byte[] html, String baseUri, String stationery, String dest)
  2. throws IOException {
  3. ConverterProperties properties = new ConverterProperties();
  4. properties.setBaseUri(baseUri);
  5. PdfWriter writer = new PdfWriter(dest);
  6. PdfDocument pdf = new PdfDocument(writer);
  7. IEventHandler handler = new Background(pdf, stationery);
  8. pdf.addEventHandler(PdfDocumentEvent.START_PAGE, handler);
  9. HtmlConverter.convertToPdf(new ByteArrayInputStream(html), pdf, properties);
  10. }

We create an instance of theBackground class, named handler (line 7). We add this instance to the PdfDocument using the addEventHandler() method (line 8). With the PdfDocumentEvent.START_PAGE parameter, we indicate that the handleEvent() method needs to be invoked every time a page starts. In this case, the method will be called three times, because the content is distributed over three pages.

If we'd look at the HTML file in a browser, we'd see one long page. When we render the same content to a PDF with page size A4, we have three pages. But what if we want to put all the content on one PDF page?

For example: some companies run a cron job that takes a snapshot of specific web pages every hour, every day, every month. It's not their intention to print this page; these companies just want an archive that allows them to know which content was online on a specific day at a specific hour.

How could they make sure that the PDF always consists of a single page of which the size is adapted to the size of the content?

Converting an HTML page to a single-page PDF

Figure 4.5 shows the same content we used for the previous examples on one long page measuring 8.26 x 26.29in.

Figure 4.5: Converting an HTML file to a single-page PDF document
Figure 4.5: Converting an HTML file to a single-page PDF document

We chose the width of the document ourselves –it's the width of an A4 page. But how do we determine the length of the page?

We can't determine the length in advance, because we only know the total height after all the content has been rendered. In the C04E03_MovieTable3.java example, we create a PDF with an initial page size of 595 x 14400 user units.

The height of 14,400 user units isn't chosen arbitrarily; it's an implementation limit of Adobe Acrobat en Adobe Reader. You can create a PDF with a page size greater than 14,400 user units in width or height, but Adobe Reader won't be able to render it. You'll only see a blank page.

We'll use the convertToDocument() method to create a Document instance. We'll use a trick to get the end position after rendering the content. We'll then change the page size so that it's reduced to the size of the content.

The createPdf() method shows us how this is done.

  1. public void createPdf(byte[] html, String baseUri, String dest)
  2. throws IOException {
  3. ConverterProperties properties = new ConverterProperties();
  4. properties.setBaseUri(baseUri);
  5. PdfWriter writer = new PdfWriter(dest);
  6. PdfDocument pdf = new PdfDocument(writer);
  7. pdf.setDefaultPageSize(new PageSize(595, 14400));
  8. Document document = HtmlConverter.convertToDocument(
  9. new ByteArrayInputStream(html), pdf, properties);
  10. EndPosition endPosition = new EndPosition();
  11. LineSeparator separator = new LineSeparator(endPosition);
  12. document.add(separator);
  13. document.getRenderer().close();
  14. PdfPage page = pdf.getPage(1);
  15. float y = endPosition.getY() - 36;
  16. page.setMediaBox(new Rectangle(0, y, 595, 14400 - y));
  17. document.close();
  18. }

There's nothing new in lines 1 to 6. We set the extraordinary page size in line 7, and we convert the HTML to a Document instance in lines 8 and 9. In line 10, we create an instance of the EndPosition class. We'll pass this instance to a LineSeparator (line 11), and we'll add this separator to the Document (line 12). We close the document's renderer in line 13. This will cause all the content to be rendered, including the line separator.

We then get the page object of the first page (line 14), assuming that this is the only page in the document. This will be true as long as the required space is lower than 14,400.

Finally, we get the Y-value of the end position (line 15), and we use this y value to change the page size of the first page (line 16). After changing this page size, we close the document (line 17).

What happened here? We added a LineSeparator, but when we look at the resulting PDF, we don't see any line. That's because we created an ILineDrawer implementation that doesn't draw anything. Instead, we use the ILineDrawer to get the Y-coordinate of the end of the content.

Let's take a look at the EndPosition class to see how this works:

  1. class EndPosition implements ILineDrawer {
  2. protected float y;
  3.  
  4. public float getY() {
  5. return y;
  6. }
  7.  
  8. @Override
  9. public void draw(PdfCanvas pdfCanvas, Rectangle rect) {
  10. this.y = rect.getY();
  11. }
  12. @Override
  13. public Color getColor() {
  14. return null;
  15. }
  16. @Override
  17. public float getLineWidth() {
  18. return 0;
  19. }
  20. @Override
  21. public void setColor(Color color) {
  22. }
  23. @Override
  24. public void setLineWidth(float lineWidth) {
  25. }
  26. }

We override all the methods of the ILineDrawer interface, but only one method is important to us: the draw() method. This method gives us a Rectangle instance that marks the current position of the cursor in the PDF at the moment the LineSeparator is to be rendered. We don't draw anything at this position. Instead, we retrieve the Y-coordinate, which we store in a member-variable. After the LineSeparator has been "rendered", we can retrieve this Y-position using the getY() method.

In the next example, we'll use a different XSLT file to create a different view on the data. We'll also introduce bookmarks.

Adding bookmarks to a report

Figure 4.6 shows a PDF with the same content we had before, but organized in a slightly different way because we now used the movies_overview.xsl file to transform the XML to HTML.

Figure 4.6: Creating a PDF with bookmarks
Figure 4.6: Creating a PDF with bookmarks

Observe that the resulting PDF document has bookmarks. When we click on the title of a movie, we jump to the location in the document where we can find more information about this movie.

The C04E04_MovieOverview example shows why these bookmarks were added.

public void createPdf(byte[] html, String baseUri, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    OutlineHandler outlineHandler = OutlineHandler.createStandardHandler();
    properties.setOutlineHandler(outlineHandler);
    HtmlConverter.convertToPdf(
        new ByteArrayInputStream(html), new FileOutputStream(dest), properties);
}

Creating bookmarks –or outlines as they are called in the PDF standard– is done by creating an OutlineHandler, and passing this outline handler to the ConverterProperties.

In this example, we used the createStandardHandler() method to create a standard handler. In practice, this means that pdfHTML will look for <h1>, <h2>, <h3>, <h4>, <h5>, and <h6>. The bookmarks will be created based on the hierarchy of those tags in the HTML file. In the movie overview we created, we only have <h1> tags. That explains why the book marks are only one level deep.

We can also create a custom OutlineHandler.

In figure 4.7, we see a second level of bookmarks consisting of the names of the directors of each movie.

Figure 4.7: Creating a PDF with bookmarks (second example)
Figure 4.7: Creating a PDF with bookmarks (second example)

The directors of each movie were added to the overview using <p> tags, whereas the rest of the info was added using <div> tags. Knowing this, we can create a custom OutlineHandler that looks for <h1> and <p> tags when creating the outlines. This is done in the C04E05_MovieOverview2 example.

public void createPdf(byte[] html, String baseUri, String dest)
    throws IOException {
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    OutlineHandler outlineHandler = new OutlineHandler();
    outlineHandler.putTagPriorityMapping("h1", 1);
    outlineHandler.putTagPriorityMapping("p", 2);
    properties.setOutlineHandler(outlineHandler);
    HtmlConverter.convertToPdf(
        new ByteArrayInputStream(html), new FileOutputStream(dest), properties);
}

In this createPdf() method, we create a new OutlineHandler, and we add tag priorities for the <h1>-tag (priority 1) and the <p>-tag (priority 2). Should we have <h2>, <h3>, or any other tag in our HTML, then those tags would be ignored when creating the outline tree.

In the next set of examples, we are going to create some invoices. In many countries, the law requires companies to archive invoices for a certain number of years. There is a subset of PDF called PDF/A where the A stands for Archiving. PDF/A is the format you need for the long-term preservation of documents. When creating invoices, it's considered best practices to create then in the PDF/A format.

Creating PDF/A documents with pdfHTML

Figure 4.8 shows a PDF invoice in the PDF/A-2B format. It was created from the same XML we used for the previous examples in this chapter, but with a different XSLT file, movies_invoice.xsl.

Figure 4.8: a PDF/A-2B document
Figure 4.8: a PDF/A-2B document

PDF/A is also known as the ISO 19005 standard. It's a subset of ISO 32000 defining a set of obligations and restrictions. For instance:

  • There is the obligation for the file to contain metadata in the eXtensible Metadata Platform (XMP) format described in ISO 16684,

  • You need to add the correct color profile to the file, so that there are no ambiguities about colors,

  • The document must be self-contained: all fonts need to be embedded, no external movie, sound or other binary files are allowed, and so on.

  • JavaScript is not allowed, nor is encryption.

There are currently three parts of this standard. Approved parts will never become invalid. New parts are created to define new, useful features.

  • PDF/A-1 dates from 2005. It's based on PDF 1.4, and it defines two levels: B is the "basic" level that ensures the preservation of the visual appearance; A is the "Accessible" level which adds the requirement for the PDF to be tagged on top of the requirements for Level B.

  • PDF/A-2 dates from 2011. It's based on ISO 32000-1, and it adds some features to PDF/A-1 that were introduced in PDF 1.5, 1.6, and 1.7, such as support for JPEG2000, collections, object-level XMP, and optional content. There's also improved support for transparency, comment types and annotations, and digital signatures. It defines three levels: the "basic" level B; the "accessible" level A; and the "unicode" level U. Level U is similar to Level B, but with the extra requirement that all text needs to be stored in Unicode.

  • PDF/A-3 dates from 2012. It's identical to PDF/A-2 with one major difference: in PDF/A-2 all attachments need to be PDF documents that compliant with the PDF/A-1 or PDF/A-2 standard; in PDF/A-3, all kinds of attachments are allowed (regular PDF files, XML, docx, xslx,...).

Aside from the different layout we used to create a document that looks like an invoice, we had to make an important change to the HTML that relates to PDF/A. In the movies_invoice.xsl XSLT file, we define a font in the <body>-tag: <body style="font-family: FreeSans">.

As we'll see in chapter 6, FreeSans is a font that is shipped with pdfHTML and that is always embedded, as opposed to the default font Helvetica –which is the font that was used in the previous examples. Embedding all fonts is one of the requirements of PDF/A.

Let's take a look at the createPdf() method of the C04E06_MovieInvoice example to see what else is different:

public void createPdf(byte[] html, String baseUri, String dest, String intent) throws IOException {
    PdfWriter writer = new PdfWriter(dest);
    PdfADocument pdf = new PdfADocument(writer,
        PdfAConformanceLevel.PDF_A_2B,
        new PdfOutputIntent("Custom", "", "http://www.color.org",
        "sRGB IEC61966-2.1", new FileInputStream(intent)));
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    HtmlConverter.convertToPdf(new ByteArrayInputStream(html), pdf, properties);
}

The only difference with the previous examples, is that we now use a PdfADocument instead of merely a PdfDocument. We add the PdfAConformanceLevel –in this case PDF_A_2B for PDF/A-2B conformance– as a parameter for the constructor, and we pass the color profile using a PdfOutputIntent object.

Creating a PDF/A-2A file only requires two minor changes. See the C04E07_MovieInvoice2 example:

public void createPdf(byte[] html, String baseUri, String dest, String intent)
    throws IOException {
    PdfWriter writer = new PdfWriter(dest);
    PdfADocument pdf = new PdfADocument(writer,
        PdfAConformanceLevel.PDF_A_2A,
        new PdfOutputIntent("Custom", "", "http://www.color.org",
        "sRGB IEC61966-2.1", new FileInputStream(intent)));
    pdf.setTagged();
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    HtmlConverter.convertToPdf(new ByteArrayInputStream(html), pdf, properties);
}

We change the PdfAConformanceLevel from PDF_A_2B to PDF_A_2A, and since the second A stands for "accessible" PDF, we need to make sure that we create a Tagged PDF, hence the extra line pdf.setTagged(). Figure 4.9 shows that Adobe Acrobat assumes that the document is compliant with PDF/A-2A as well as with the PDF/UA-1 standard, where UA stands for Universal Accessibility.

Figure 4.9: a PDF/A-2A - PDF/UA-1 document
Figure 4.9: a PDF/A-2A - PDF/UA-1 document

We can use Preflight to verify if this file is compliant with the PDF/A-2A standard, but it's impossible to check full compliance with the PDF/UA-1 standard. PDF/UA has a series of requirements that can only be verified by a human being. For instance: only a human being can check if the PDF was properly tagged; that is: if all the semantic information is correct.

Figure 4.10: Semantic structure of the PDF invoice
Figure 4.10: Semantic structure of the PDF invoice

In the previous chapter, we already created some Tagged PDF files, and we briefly discussed that Tagged PDF is both important for disabled people using AT, as well as in the context of Next-Generation PDF. In figure 4.10, we see that iText added a table structure (see the <table>, <TR>, and <TD> tags). It's up to a human being to check whether or not this table structure is the correct semantic representation of the content.

If we'd add an attachment to a PDF/A-2 file, that attachment should be a PDF/A-2 document too. This requirement doesn't exist for PDF/A-3. For instance: we could add the original XML file that was used to create the HTML as extra data to the PDF document.

That's what we've done in the C04E08_MovieInvoice3 example:

public void createPdf(
    byte[] xml, byte[] html, String baseUri, String dest, String intent)
    throws IOException {
    PdfWriter writer = new PdfWriter(dest);
    PdfADocument pdf = new PdfADocument(writer,
        PdfAConformanceLevel.PDF_A_3A,
        new PdfOutputIntent("Custom", "", "http://www.color.org",
        "sRGB IEC61966-2.1", new FileInputStream(intent)));
    pdf.setTagged();
    pdf.addFileAttachment(
        "Movie info", xml, "movies.xml",
        PdfName.ApplicationXml, new PdfDictionary(), PdfName.Data);
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    HtmlConverter.convertToPdf(new ByteArrayInputStream(html), pdf, properties);
}

We changed PDF_A_2A into PDF_A_3A, and we used the addFileAttachment() method to add an attachment that consists of Data that provides more info about the movies mentioned on the invoice.

We can see this attachment when we open the attachment panel in Adobe Reader; see figure 4.11.

Figure 4.11: A PDF/A-3A document with an XML attachment
Figure 4.11: A PDF/A-3A document with an XML attachment

It's no coincidence that I chose the example of an invoice to demonstrate the PDF/A functionality. As a matter of fact, several countries use a PDF invoice standard, known as the ZUGFeRD standard. This standard is based on PDF/A3-B, and requires the PDF to have an attachment that complies with the Cross Industry Invoice (CII) standard.

If you want to know more about the ZUGFeRD standard, or if you want to use pdfHTML to create invoices that are compliant with the ZUGFeRD standard, please read the ZUGFeRD: The Future of Invoicing tutorial.

Summary

In this chapter, we used XSLT to create HTML from XML in order to create reports and invoices. We found out how we can create bookmarks in an automated way, and we learned how to create PDF/A documents. By doing so, we covered some standard use cases that exist at the core of many different industries. In the next chapter, we'll discover how to extend pdfHTML with custom functionality such as support for custom tags and custom CSS behavior.