Chapter 6: Using fonts in pdfHTML

Tags: fontspdfHtmlHTML to PDFparsingHTMLFontProvideriText 7HTMLeBookconverting HTML to PDF with pdfHTML

Up until now, we haven't spent much attention to the fonts that were used when we converted HTML to PDF. We know that Helvetica is the default font used by iText when no font is specified (chapter 2), and we know that pdfHTML ships with some built-in fonts if you need to embed a font (chapter 4), but we didn't get a clear overview of which fonts are supported as of yet.

There are two things you need to know before reading this chapter:

  • The "iText core" library supports Type1 fonts (.AFM/.PFB), the old TrueType fonts (.TTF), OpenType fonts with Type1 outlines (.otf), OpenType fonts with TrueType outlines (.ttf) and TrueType collections (.ttc), as well as the Web Open Font Format (.woff).

  • The pdfHTML add-on uses a DefaultFontProvider that by default only provides support for the 14 Standard Type 1 fonts and 12 fonts that are built-in into pdfHTML. You can configure the font provider to support more fonts.

In this chapter, we're going to look at some examples that use the default fonts provided in pdfHTML, and we're going to unlock access to all the other types of fonts that are supported by the core library.

Standard Type 1 fonts

Section 9.6.2.2 of ISO 32000 (part 1 as well as part 2) provides a list of the Standard Type 1 Fonts (aka Standard 14 Fonts).

Section 9.6.2.2: Standard Type 1 Fonts (Standard 14 Fonts)

The PostScript names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique.

These fonts, or their font metrics and suitable substitution fonts, shall be available to the conforming reader.

The shall in that last sentence means that you don't have to embed these fonts when creating a PDF document, because you can expect that every PDF viewer knows how to render these fourteen fonts. iText ships with the 14 Adobe Font Metrics (AFM) files that correspond with these Standard 14 fonts, which means that these fonts are always supported. However, since the corresponding Printer Font Binaries are proprietary, iText will never embed these fonts.

The fonts_standardtype1.html HTML page lists the fourteen fonts: 4 Helvetica fonts, 4 Times fonts, 4 Courier fonts, Symbol, and ZapfDingbats. As you can tell from figure 6.1, Helvetica, Times, and Courier are rendered correctly by the browser.

Figure 6.1: Standard Type 1 fonts (HTML)
Figure 6.1: Standard Type 1 fonts (HTML)

The Symbol and ZapfDingbats are fonts with a custom encoding. They don't play well with HTML. In the FAQ chapter, we'll discover that there are other fonts that are better suited for symbols such as the ones provided in Symbol and ZapfDingbats. For now, we only have numbers for the symbol font (0123456789) font and a non-breaking space character ( ) for the ZapfDingbats font.

In the C06E01_StandardType1 example, we use the simple createPdf() method we've used many times in the previous chapters:

public void createPdf(String src, String dest) throws IOException {
    HtmlConverter.convertToPdf(new File(src), new File(dest));
}

When we look at the Fonts tab in the Document Properties shown in figure 6.2, we see that all 14 fonts were used in the PDF document.

Figure 6.2: Standard Type 1 fonts (PDF)
Figure 6.2: Standard Type 1 fonts (PDF)

None of these fonts are embedded, because iText only ships with the font metrics, not with the font binaries. The fonts are substituted by a font that is available on the local machine. In this case, Courier has been replaced by CourierStd, Helvetica has been replaced by ArialMT, and Times-Roman by TimesNewRomanPSMT.

Different PDF viewers on different operating systems may use other fonts as the "Actual Font." This can be problematic, for instance when you want to create PDF/A documents. To solve this problem, the pdfHTML add-on ships with 12 free fonts.

Fonts shipped with iText

The pdfHTML add-on supports font embedding of three font families out-of-the box: a sans font, a serif font, and a monospaced font family. For each of these font families, four fonts are available: a regular font, a bold font, an italic (or oblique) font, and a bold-italic (or bold-oblique) font.

In the fonts_shipped.html HTML file, we use the font-family: FreeSans, font-family: FreeSerif, and font-family: FreeMono. We could also have used font-family: sans, font-family: serif, and font-family: mono instead; that would have led to the same result. We use different combinations of the font-weight: bold and font-style: italic so that we can show the four fonts of every font family.

Figure 6.3: Fonts shipped with pdfHTML (HTML)
Figure 6.3: Fonts shipped with pdfHTML (HTML)

The browser I used to render this HTML page doesn't know where to find the sans and the monospaced font; see figure 6.3.

This is a pity, but the corresponding PDF created with the C06E02_ShippedFonts example looks alright.

Figure 6.4: Fonts shipped with pdfHTML (PDF)
Figure 6.4: Fonts shipped with pdfHTML (PDF)

Looking at the Font panel of the Document Properties in figure 6.4, we see that a subset of each of the twelve fonts was embedded, as opposed to Helvetica and Helvetica-Bold which aren't embedded at all.

These 26 fonts, of which only 24 are really useful in the context of HTML, are the only fonts that are supported by default if you don't change the font provider.

That's pretty limited, so let's find out how we can add support for more fonts. For instance: wouldn't it be nice if we had access to all the system fonts that are provided by the operating system we're working on?

System fonts

In the fonts_system.html HTML file, we introduced font families such as Calibri and Verdana. I am writing this tutorial on a Windows machine, and my browser can render the different Calibri and Verdana fonts correctly (see figure 6.5), because the corresponding font programs are available in the C:\Windows\Fonts directory.

Figure 6.5: System fonts (HTML)
Figure 6.5: System fonts (HTML)

By default, pdfHTML uses an instance of the DefaultFontProvider that is created like this:

FontProvider provider = new DefaultFontProvider();

This constructor calls another constructor that takes three Boolean values as parameter. The above line is equivalent to:

FontProvider provider = new DefaultFontProvider(true, true, false);

The Boolean values each cause a certain type of fonts to be registered:

  1. registerStandardPdfFonts– will register the fourteen standard Type 1 fonts,

  2. registerShippedFreeFonts– will register the twelve shipped fonts,

  3. registerSystemFonts– will register the system fonts.

The default value for registerStandardPdfFonts and registerShippedFreeFonts is true, because those fonts require hardly any resources.

The default value for registerSystemFonts is false, because when you set this value to true, iText will search for directories that contain system fonts on your operating system. This has the some disadvantages:

  1. Loading and selecting the fonts risks being time-consuming, and

  2. We can't control the order in which the fonts are added if we register full directories,

  3. If by any chance you hit a font with embedding restrictions, you're out of luck.

But let's not worry about that right now, and let's change the font provider to support system fonts anyway. Let's set all the Boolean values to true in the C06E03_SystemFonts example, and see what happens.

public void createPdf(String src, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    properties.setFontProvider(new DefaultFontProvider(true, true, true));
    HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}

Now that we have changed the ConverterProperties, we can use almost any font that is in C:\\Windows\Fonts. Among those fonts are the fonts of the Calibri and Verdana family. Subsets of these fonts are now embedded in the PDF as shown in figure 6.6.

Figure 6.6: System fonts (PDF)
Figure 6.6: System fonts (PDF)

If you're working on another operating system, for instance Linux, you don't have a C:\\Windows\Fonts directory, but that's not a problem.

iText tries to get the fonts directory using environment variables, and on top of that searches for directories such as:

  • /usr/share/X11/fonts,
  • /usr/X/lib/X11/fonts,
  • /usr/openwin/lib/X11/fonts,
  • /usr/share/fonts,
  • /usr/X11R6/lib/X11/fonts,
  • /Library/Fonts, and
  • /System/Library/Fonts.

Use this system fonts functionality with care. Usually, it's not your best option to give pdfHTML access to all of your system fonts. Let's take a look at some alternative options.

Web Open Font Format fonts

The Web Open Font Format is a font format for use in web pages. WOFF fonts are essentially OpenType or TrueType fonts with compression and additional metadata.

In the fonts_woff.html HTML file, you can see how we define a series of six fonts of the SourceSerifPro font family.

Figure 6.7: Web Open Font Format (HTML)
Figure 6.7: Web Open Font Format (HTML)

First we define a @font-face, for instance:

@font-face {
    font-family: "SourceSerifPro-Regular";
    src: url("fonts/SourceSerifPro-Regular.otf.woff") format("woff");
}

Then we define a class, for instance:

.regular {
    font-family: "SourceSerifPro-Regular";
}

Finally, we use this class in our HTML:

<td class="regular">quick brown fox jumps over the lazy dog</td>

We don't have to make any changes to the ConverterProperties in the C06E04_WebOpenFormatFonts example:

public void createPdf(String src, String dest) throws IOException {
    HtmlConverter.convertToPdf(new File(src), new File(dest));
}

The pdfHTML add-on will download the WOFF fonts (shown in the bottom-right corner of figure 6.7) automatically, and embed a subset of those fonts in the PDF as shown in figure 6.8.

Figure 6.8: Web Open Font Format (PDF)
Figure 6.8: Web Open Font Format (PDF)

Support for WOFF fonts is especially welcome if you want to convert web pages found in the wild to PDF, but please take into account that your HTML to PDF conversion process risks being slow when using this approach. The fonts are downloaded over a network, and that typically slows things down.

The fastest option is to add selected fonts to the font provider.

Adding selected fonts to the font provider

In the fonts_extra.html HTML file, we write the words "quick brown fox jumps over the lazy dog" three times:

  • Once in a regular font of the font family Cardo,

  • Once in a bold font of the font family Cardo, but if by any chance Cardo-Bold can't be found, a Times font will be used instead,

  • Once in an italic font of the font family Cardo, but if by any chance Cardo-Italic can't be found, a Times font will be used instead.

This is shown in figure 6.9:

Figure 6.9: Extra fonts (HTML)
Figure 6.9: Extra fonts (HTML)

We adapt the font provider in the C06E05_ExtraFont example:

public static final String FONT = "src/main/resources/fonts/cardo/Cardo-Regular.ttf";
public void createPdf(String src, String font, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    FontProvider fontProvider = new DefaultFontProvider();
    FontProgram fontProgram = FontProgramFactory.createFont(font);
    fontProvider.addFont(fontProgram);
    properties.setFontProvider(fontProvider);
    HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}

We go through the following steps:

  • We create a DefaultFontProvider instance.

  • We create a FontProgram by passing the path to the font program Cardo-Regular.ttf to the FontProgramFactory.

  • We add this font program to the font provider, and we set this font provider as a converter property.

This way, we add one selected font to the font provider. As a result, the words "quick brown fox jumps over the lazy dog" for which we defined a regular font, will be rendered using the font Carbo-Regular. Since we didn't provide any bold or italic font of the Carbo family, the Standard Type 1 fonts Roman-Bold and Roman-Italic are used for the other two lines.

Figure 6.10: Extra font (PDF)
Figure 6.10: Extra font (PDF)

Let's fix this. In the C06E06_ExtraFonts example, we don't use the addFont() method to add one font at a time. Instead, we use the addDirectory() method to add three Cardo fonts at once:

public static final String FONTS = "src/main/resources/fonts/cardo/";
public void createPdf(String src, String fonts, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    FontProvider fontProvider = new DefaultFontProvider();
    fontProvider.addDirectory(fonts);
    properties.setFontProvider(fontProvider);
    HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}

Since the Cardo directory also contains a Bold and an Italic font of the Cardo font family, iText no longer has to fall back on the Times font family. See figure 6.11.

Figure 6.11: Extra fonts (PDF)
Figure 6.11: Extra fonts (PDF)

Be careful when adding a directory with a large selection of fonts. The order in which fonts are added to the font provider is important.

Choosing the right order for your font selection

When we talked about using system fonts, we mentioned that we can't control the order in which the fonts are added if we register full directories. We'll find out why this is a disadvantage using the simple hello.html HTML file from chapter 1. See figure 6.12.

Figure 6.12: Hello HTML with no font specified
Figure 6.12: Hello HTML with no font specified

We'll convert this simple HTML file to PDF twice, using the same createPdf() method, but we'll create a DefaultFontProvider instance that doesn't register any of the Standard Type 1 fonts, doesn't register any of the built-in fonts, nor any of the system fonts. This will exclude the use of fonts such as Helvetica, FreeSans, or any other font as the default font.

We'll add a selection of fonts of which the paths are stored in a String array named fonts; pdfHTML will have to use one of those fonts as default.

public void createPdf(String src, String[] fonts, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    FontProvider fontProvider = new DefaultFontProvider(false, false, false);
    for (String font : fonts) {
        FontProgram fontProgram = FontProgramFactory.createFont(font);
        fontProvider.addFont(fontProgram);
    }
    properties.setFontProvider(fontProvider);
    HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}

We'll use the same selection of fonts, NotoSans-Regular.ttf and Cardo-Regular.ttf, in two examples, C06E07_ExtraFontsOrder1 and C06E08_ExtraFontsOrder2, yet the PDFs generated by these two examples will be different.

In the first example, C06E07_ExtraFontsOrder1, we'll create the selection like this:

public static final String[] FONTS = {
    "src/main/resources/fonts/noto/NotoSans-Regular.ttf",
    "src/main/resources/fonts/cardo/Cardo-Regular.ttf"
};

The Noto font is the first element in the array, and it contains all the glyphs needed to render our HTML file to PDF, hence there's no need for the Cardo font. See figure 6.13:

Figure 6.13: Hello PDF with Noto font
Figure 6.13: Hello PDF with Noto font

In the second example, C06E08_ExtraFontsOrder2, we'll reverse the order of the font paths:

public static final String[] FONTS = {
    "src/main/resources/fonts/cardo/Cardo-Regular.ttf",
    "src/main/resources/fonts/noto/NotoSans-Regular.ttf"
};

Now the Carbo font is the first element in the array, and it too contains all the glyphs needed to render our HTML file to PDF, hence there's no need for the Noto font. See figure 6.14:

Figure 6.14: Hello PDF with Carbo font
Figure 6.14: Hello PDF with Carbo font

These two examples explain an important aspect of the inner workings of pdfHTML. When pdfHTML needs to render a character as a glyph, it will first search for a font name in the HTML, and it will ask the font provider if there's a font available with that name. If no font is found, or if no font name was provided, pdfHTML will loop over the different fonts that are registered to the font provider, in the order in which they were registered, As soon as pdfHTML finds a font that can render the character as a glyph, it will use that font.

When you register a full directory, for instance by including all system fonts, you can't control the order in which the different font programs are added to the font provider. This makes it very hard to predict which font will be used by pdfHTML. This is especially problematic if you write an application that can be migrated to different systems. Different systems may have different system fonts, and this may lead to PDF documents that look completely different because a different font is used. There's also the risk that a font directory contains a font with embedding restrictions. When pdfHTML encounters such a font, an exception will be thrown.

There's one important aspect of fonts that we didn't spend any attention to so far. When we use a font, we map characters in an HTML file to glyphs in a PDF document. An a character can be mapped to different visualisations of the letter a, for instance 'a', 'a', 'a' , or even 'α' or '@' or any other glyph depending on the encoding that is used.

For more info about fonts and encoding, see Chapter 1: Introducing the PdfFont class in the iText 7: building blocks tutorial for more info.

Choosing the encoding that is right for you

When using Standard Type 1 fonts, iText uses the Winansi encoding for the Helvetica, Times, and Courier font family. The Symbol and the ZapfDingbats fonts have their own custom encoding.

In the case of Winansi encoding, iText creates a simple font. A simple font maps a maximum of 256 characters to 256 glyphs, which means that each character can consist of only one byte. If you want support for more than 256 characters in one font, you need a composite font. For instance: if you use the Identity-H encoding, the characters are stored as Unicode characters.

Using Unicode, or at least providing a toUnicode mapping, is considered best practice in PDF. It's a requirement for PDF/A Level U, and it's a requirement in terms of accessibility because Unicode mapping allows the retrieval of semantic properties about every character referenced in the file.

The pdfHTML add-on will try to use Unicode whenever possible. That explains why many of the examples show Identity-H for the encoding in the screen shots, except in the cases where Standard Type 1 fonts are used. The Standard Type 1 fonts don't have Unicode support, hence Winansi is used instead.

If you don't agree with the default encoding chosen by iText, you can define your own encoding. See for instance the C06E09_Encoding example, where we use Carbo-Regular just like we did in the previous example, but instead of having iText pick the encoding, we explicitly tell iText to use Winansi:

public static final String FONT = "src/main/resources/fonts/cardo/Cardo-Regular.ttf";
public void createPdf(String src, String font, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    FontProvider fontProvider = new DefaultFontProvider(false, false, false);
    FontProgram fontProgram = FontProgramFactory.createFont(font);
    fontProvider.addFont(fontProgram, "Winansi");
    properties.setFontProvider(fontProvider);
    HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}

When we compare the fonts panel of the Document Properties in figure 6.15 with the one in figure 6.14, we now see "Ansi" instead of "Identity-H" for the encoding.

Figure 6.15: Hello PDF with Carbo font; Ansi encoding
Figure 6.15: Hello PDF with Carbo font; Ansi encoding

This change from a composite font to a simple font also has an impact on the file size.

  • With the (Win)Ansi encoding, each character is stored as a single byte;

  • With the Identity-H encoding, every character is stored as two bytes.

Figure 6.16 shows the difference in file size between the file fonts_cardo.pdf (Identity-H encoding) and the file fonts_encoding.pdf (Ansi encoding).

Figure 6.16: Composite versus Simple font
Figure 6.16: Composite versus Simple font

The difference in file size is limited, because the content streams containing either the single-byte or the double-byte characters are both compressed.

If file size is an issue, you can consider using Winansi encoding instead of Identity-H, but be aware that this comes at a cost. If you want your files to be compliant with current and future standards for long-term preservation or accessibility, it might be better to create files that are slightly bigger in file size, but that use Unicode.

You will also use Unicode if you want to create documents with content in different languages.

Internationalization

The fonts_i18n.html HTML file contains a table with the English title of a movie in the first column, and the title of that same movie in a different language in the second column.

Figure 6.17: Internationalization (HTML code)
Figure 6.17: Internationalization (HTML code)

We've stored this file using the UTF-8 encoding, and we've clearly indicated in the HTML header that all characters in this HTML file should be treated as UTF-8 characters:

<meta charset="UTF-8">

If you'd omit this line, you'd end up with a typical encoding problem as shown in figure 6.18.

Figure 6.18: Internationalization (wrong browser view)
Figure 6.18: Internationalization (wrong browser view)

That's definitely not what we want. You can get similar gibberish in iText if you read a UTF-8 file as if it were a plain ASCII file. If you use the correct encoding, you'll see the page as shown in figure 6.19.

Figure 6.19: Internationalization (Browser view)
Figure 6.19: Internationalization (Browser view)

If we use the C06E10_InternationalizationWrong example to convert this HTML to PDF, our simple createPdf() method won't be sufficient.

public void createPdf(String src, String dest) throws IOException {
    HtmlConverter.convertToPdf(new File(src), new File(dest));
}

There are two major pitfalls with this wrong (!) example. These issues are explained in figure 6.20 and figure 6.21.

When we look at the Hebrew and Arabic text in figure 6.20, we see that characters are rendered as glyphs, but they are rendered in the wrong order. Hebrew and Arabic are written from right to left, and extra processing is needed to detect and apply the correct writing system.

Figure 6.20: Internationalization done wrong (PDF without pdfCalligraph)
Figure 6.20: Internationalization done wrong (PDF without pdfCalligraph)

Adding content in a "special" writing system requires more CPU, and for reasons of performance, iText doesn't spend that CPU by default. If you want to convert Hebrew, Arabic, or Indic (Hindi, Kannada, Tamil, Telugu,...) content, you need to explicitly include the pdfCalligraph add-on in your CLASSPATH. This will activate the special typography functionality.

Once you have pdfCalligraphy installed, the text in Hebrew already looks better, but there are still some serious problems with the result as you can see in figure 6.21.

Figure 6.21: Internationalization done wrong (PDF)
Figure 6.21: Internationalization done wrong (PDF)

The Chinese and Korean titles are still missing, and so are several characters in the Japanese title. The Arabic characters are there, and they are now in the right order, but they are all wrong because the ligatures aren't made. Ligatures are supported out of the box when you use pdfCalligraph, but this add-on uses information that is stored inside the font to create the ligatures. Unfortunately, the built-in fonts don't support Arabic ligatures.

We can solve all of these problems by introducing fonts that support Chinese (such as NotoSansCJKsc-Regular), Japanese (such as NotoSansCJKjp-Regular), Korean (such as NotoSansCJKkr-Regular), Hebrew (such as NotoSansHebrew-Regular), and Arabic (such as NotoNaskhArabic-Regular). You can find all of these fonts in a directory with Noto Sans regular fonts.

We'll use these fonts in the C06E11_Internationalization example, to get the result we expect.

public static final String[] FONTS = {
    "src/main/resources/fonts/noto/NotoSans-Regular.ttf",
    "src/main/resources/fonts/noto/NotoSans-Bold.ttf",
    "src/main/resources/fonts/noto/NotoSansCJKsc-Regular.otf",
    "src/main/resources/fonts/noto/NotoSansCJKjp-Regular.otf",
    "src/main/resources/fonts/noto/NotoSansCJKkr-Regular.otf",
    "src/main/resources/fonts/noto/NotoNaskhArabic-Regular.ttf",
    "src/main/resources/fonts/noto/NotoSansHebrew-Regular.ttf"
};
public void createPdf(String src, String[] fonts, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    FontProvider fontProvider = new DefaultFontProvider(false, false, false);
    for (String font : fonts) {
        FontProgram fontProgram = FontProgramFactory.createFont(font);
        fontProvider.addFont(fontProgram);
    }
    properties.setFontProvider(fontProvider);
    HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}

We can now compare the screen shot of the PDF we created using pdfCalligraph and the appropriate fonts as shown in figure 6.22, with the HTML page rendered in the browser as shown in figure 6.19 with .

Figure 6.22: Internationalization (PDF)
Figure 6.22: Internationalization (PDF)

Note that we explicitly excluded the Standard Type 1 fonts, the built-in fonts, and the system fonts. We gave complete priority to the Noto fonts. Granted, we didn't use the exact same fonts as were used by the browser, but at least all the characters are there, and the ligatures are made correctly. If we want a better match, we'll need to search for the fonts used by the browser, and add the paths to the corresponding font programs to the font provider.

Summary

In this chapter, we've experimented with different types of fonts. We've learned that only a limited set of fonts is supported by default, but also that we can add support for almost any font we like, provided that we have access to the corresponding font program.

We've also discovered that iText supports writing systems that are different from the Western left-to-right writing system, and that there's support for ligatures (Arabic, Indic,...), but only if we include the pdfCalligraphy add-on.

In the next (and final) chapter, we'll look at some frequently asked questions.