Chapter 1: Hello HTML to PDF

Tags: pdfHtmlHTML to PDFparsing HTMLHTMLiText 7eBookconverting HTML to PDF with pdfHTML

In this chapter, we'll convert a simple HTML file to a PDF document in many different ways. The content of the HTML file will consist of a "Test" header, a "Hello World" paragraph, and an image representing the iText logo.

Structure of the examples

All the examples throughout this book will have a similar structure.

INPUT:

For the input, we'll provide HTML syntax. In this tutorial, we'll use an HTML String, a path to an HTML file, or –in chapter 4– a path to an XML file along with the path to an XSLT file to convert the XML to HTML.

In the first example, C01E01_HelloWorld, the HTML is provided as a String:

public static final String HTML = "<h1>Test</h1><p>Hello World</p>";

In other examples, such as C01E03_HelloWorld, we'll use two constants:

  • a BASEURI constant for the path to the parent folder where to find the source HTML and resources such as images and CSS, and

  • a SRC constant with the path to that source HTML file.

For instance:

public static final String BASEURI = "src/main/resources/html/";
public static final String SRC = String.format("%shello.html", BASEURI);

OUTPUT:

We'll use a similar structure for the output:

  • a TARGET constant for the path to the folder to which we'll write the resulting PDF, and

  • a DEST constant with the path to that PDF.

For instance:

public static final String TARGET = "target/results/ch01/";
public static final String DEST = String.format("%stest-03.pdf", TARGET);

MAIN METHOD:

The main() method of all the examples in this book won't differ much from the main() method of our first example:

  1. public static void main(String[] args) throws IOException {
  2. LicenseKey.loadLicenseFile(
  3. System.getenv("ITEXT7_LICENSEKEY") + "/itextkey-html2pdf_typography.xml");
  4. File file = new File(TARGET);
  5. file.mkdirs();
  6. new C01E01_HelloWorld().createPdf(HTML, DEST);
  7. }

First we load the iText license file (line 2-3). This is an XML file containing a license key for using iText. You might not need this license key if you are using iText and pdfHTML in the context of an AGPL project. However, you will need the pdfCalligraph add-on for the internationalization examples in chapter 6, and the pdfCalligraph add-on isn't available under the AGPL; it's a closed source add-on only.

The license key we are using in the examples of this book is similar to the key you will get if you purchase a commercial license to use iText 7, pdfHTML, and pdfCalligraph in a closed source context.

In lines 4 and 5, we create the target directory in case it doesn't exist yet. In line 6, we call the createPdf() method. We can implement this methods in many different ways.

Converting HTML to PDF

The implementation of the createPdf() method of the C01E01_HelloWorld example is very simple. Its body consists of a single line:

public void createPdf(String html, String dest) throws IOException {
    HtmlConverter.convertToPdf(html, new FileOutputStream(dest));
}

The HtmlConverter object has a selection of different static convertToPdf() methods that take different parameters depending on the use case. In the first example, the first parameter html is a String with the following value:

public static final String HTML = "<h1>Test</h1><p>Hello World</p>";

This HTML snippet is converted to the PDF document that is shown in figure 1.1.

Figure 1.1: converting an HTML snippet to PDF
Figure 1.1: converting an HTML snippet to PDF

Let's introduce an image, and use the following String:

public static final String HTML =
    "<h1>Test</h1><p>Hello World</p><img src=\"img/logo.png\">";

This HTML snippet contains a relative link to the image file logo.png in a subdirectory named img. It's impossible for iText to guess where to look for this subdirectory, hence we'll configure the base URI for the conversion process.

This is done using the ConverterProperties object, as shown in the createPdf() method of the C01E02_HelloWorld example.

public void createPdf(String baseUri, String html, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    HtmlConverter.convertToPdf(html, new FileOutputStream(dest), properties);
}

We create a ConverterProperties object, and we set the base URI to the parent directory of the img directory where iText can find the logo.png file.

Figure 1.2 shows the result.

Figure 1.2: converting an HTML snippet containing a reference to an image
Figure 1.2: converting an HTML snippet containing a reference to an image

In most of the examples that follow, we won't use HTML stored in a String. Instead, we are going to convert an HTML file on disk into a file on disk.

For the rest of the examples in this chapter, we'll use the file named hello.html shown in figure 1.3.

Figure 1.3: hello.html shown in a browser as well as in a text editor
Figure 1.3: hello.html shown in a browser as well as in a text editor

There are different ways to convert this file to a PDF document.

In the C01E03_HelloWorld.java example, we use File objects:

public void createPdf(String baseUri, String src, String dest) throws IOException {
    HtmlConverter.convertToPdf(new File(src), new File(dest));
}

The first parameter of the convertToPdf() method refers to the source HTML file, the second parameter to the destination PDF file. In this case, we don't need to set any converter properties. If file is the File object of the HTML file, iText uses file.getParent() to get the parent directory, and uses this parent directory as the base URI.

This doesn't work for the C01E04_HelloWorld example where we use FileInputStream and FileOutputStream objects instead of File objects:

public void createPdf(String baseUri, String src, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    HtmlConverter.convertToPdf(
        new FileInputStream(src), new FileOutputStream(dest), properties);
}

You can't retrieve a parent from an InputStream, hence we need to pass a base URI to the converter using a ConverterProperties instance. The resulting PDFs of this third and fourth example look identical to the resulting PDF of the second example shown in figure 1.2. So does the resulting PDF of the fifth example, C01E05_HelloWorld:

public void createPdf(String baseUri, String src, String dest) throws IOException { 
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    PdfWriter writer = new PdfWriter(dest,
        new WriterProperties().setFullCompressionMode(true));
    HtmlConverter.convertToPdf(new FileInputStream(src), writer, properties);
}

In this case, we use a PdfWriter instance instead of a FileOutputStream. Using a PdfWriter can be useful if you want to set certain writer properties.

For more information on writer properties, please read Chapter 7 of the iText 7: Building Blocks tutorial, entitled "Handling events; setting viewer preferences and printer properties."

In this example, we create the PDF in full compression mode. To the human eye, the resulting PDF looks identical, but when you compare the file size of the PDF generated in example 4 with the file size of the PDF generated in this example, you see that full compression won us a handful of bytes.

Figure 1.4 shows 3,430 bytes when using compression as was done in PDF 1.0 to PDF 1.4; whereas the file only counts 3,263 bytes when using compression as introduced in PDF 1.5. That difference might seem small, but the more objects your PDF has, the more sense it makes to use full compression.

Figure 1.4: comparing file sizes
Figure 1.4: comparing file sizes

In the C01E06_HelloWorld example, we've replaced the PdfWriter parameter with a PdfDocument parameter.

public void createPdf(String baseUri, String src, String dest) throws IOException { 
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    PdfWriter writer = new PdfWriter(dest);
    PdfDocument pdf = new PdfDocument(writer);
    pdf.setTagged();
    HtmlConverter.convertToPdf(new FileInputStream(src), pdf, properties);
}

Using a PdfDocument instance makes sense if you want to configure a feature at the PdfDocument level. In this case, we introduce the line pdf.setTagged(), which instructs iText to create a Tagged PDF.

Figure 1.5 shows the resulting PDF with the Tags panel opened.

Figure 1.5: creating Tagged PDF
Figure 1.5: creating Tagged PDF

Looking at the Tags panel, you can see the structure of the content. When hovering over the image, you see the value of the alt attribute of the <img>-tag as a tooltip.

For more info on Tagged PDF, please read Chapter 7 of the iText 7: Jump-Start Tutorial, entitled "Creating PDF/UA and PDF/A documents."

We'll dive deeper into Tagged PDF and making PDFs "accessible" in chapter 3.

Converting HTML to iText objects

The convertToPdf() methods create a complete PDF file. Any File, OutputStream, PdfWriter, or PdfDocument that is passed to the convertToPdf() method is closed once the input is parsed and converted to PDF. This might not always be what you want.

In some cases, you want to add some extra information to the Document, or maybe you don't want to convert the HTML to a PDF file, but to a series of iText objects you can use for a different purpose. That's what the convertToDocument() and convertToElements() methods are about.

In the C01E07_HelloWorld example, we convert our Hello World HTML to a Document because we want to add some extra content after we've done parsing the HTML:

public void createPdf(String baseUri, String src, String dest) throws IOException { 
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    PdfWriter writer = new PdfWriter(dest);
    PdfDocument pdf = new PdfDocument(writer);
    Document document =
        HtmlConverter.convertToDocument(new FileInputStream(src), pdf, properties);
    document.add(new Paragraph("Goodbye!"));
    document.close();
}

The convertToDocument() method returns an iText Document instance. We use this Document instance to add some extra content ("Goodbye!") after the HTML has been parsed.

Figure 1.6: using the convertToDocument() method
Figure 1.6: using the convertToDocument() method

The upper part of the content in figure 1.6 was added by parsing HTML to PDF; the lower part –the "Goodbye!" at the end– was added using a document.add() instruction.

In the C01E08_HelloWorld example, we use the convertToElements() method. This method creates a List of IElement objects. The IElement interface is implemented by all the iText building blocks.

For more info about iText's building blocks, please read the iText 7: Building Blocks tutorial.

This last example of chapter 1 adds every top-level object of the List<IElement> collection to a Document, preceded by a Paragraph that shows the name of that object:

public void createPdf(String baseUri, String src, String dest) throws IOException { 
    ConverterProperties properties = new ConverterProperties();
    properties.setBaseUri(baseUri);
    List<IElement> elements =
        HtmlConverter.convertToElements(new FileInputStream(src), properties);
    PdfDocument pdf = new PdfDocument(new PdfWriter(dest));
    Document document = new Document(pdf);
    for (IElement element : elements) {
        document.add(new Paragraph(element.getClass().getName()));
        document.add((IBlockElement)element);
    }
    document.close();
}

Looking at figure 1.7, we see that the list consisted of three elements: one Div and two Paragraph objects.

Figure 1.7: adding elements one at a time
Figure 1.7: adding elements one at a time

The header is treated as a Div, whereas the logo image is wrapped inside a Paragraph. Don't worry about this; this is part of the inner workings of iText. It's the end result that matters.

Summary

In this chapter, we've taken one very simple HTML file, and we've converted that file to PDF using different implementations of the conversion methods convertToPdf(), convertToDocument(), and convertToElements(). When you consult the API documentation for the HtmlConverter class, you'll discover some more variations on those methods. In the next chapter, we'll pick one of those methods to convert different HTML files. Each of these HTML files will use CSS in a different way.