Using fonts in PDF and iText

Tags: fontstutorialiText 5

Question:

I can't create a PDF with Cyrillic characters using iText.

I've tried this with Czech characters:

  1. document.add(new Paragraph("Všechno v pořádku?"));

But certain characters aren't displayed in the PDF:

Všechno v poádku?

As you can see, the 'ř' character is missing from the text. I encounter the same issue when trying to use Cyrillic characters, in which case no characters are displayed at all!

This post is inspired by the following StackOverflow questions:

Answer:

I can see a lot of different kinds of problems in that one line of code. Let's start with a simple example and with a few simple phrases in French: F01_Unembedded.java

  1. Document document = new Document();
  2. PdfWriter.getInstance(document, new FileOutputStream(dest));
  3. document.open();
  4. document.add(new Paragraph("Vous êtes d'où?"));
  5. document.add(new Paragraph("À tout à l'heure. À bientôt."));
  6. document.add(new Paragraph("Je me présente."));
  7. document.add(new Paragraph("C'est un étudiant."));
  8. document.add(new Paragraph("Ça va?"));
  9. document.add(new Paragraph("Il est ingénieur. Elle est médecin."));
  10. document.add(new Paragraph("C'est une fenêtre."));
  11. document.add(new Paragraph("Répétez, s'il vous plaît."));
  12. document.close();

Although the output looks perfect, there are quite a few issues with the code, which I'll explain later.

Using special characters in French
Using special characters in French

I used Google Translate to translate those sentences to Czech: F02_Unembedded.java

  1. Document document = new Document();
  2. PdfWriter.getInstance(document, new FileOutputStream(dest));
  3. document.open();
  4. document.add(new Paragraph("Odkud jste?"));
  5. document.add(new Paragraph("Uvidíme se za chvilku. Měj se."));
  6. document.add(new Paragraph("Dovolte, abych se představil."));
  7. document.add(new Paragraph("To je studentka."));
  8. document.add(new Paragraph("Všechno v pořádku?"));
  9. document.add(new Paragraph("On je inženýr. Ona je lékař."));
  10. document.add(new Paragraph("Toto je okno."));
  11. document.add(new Paragraph("Zopakujte to prosím."));

If you compare the code against the screenshot below then you'll notice that multiple characters are missing:

Wrong Czech example: missing characters
Wrong Czech example: missing characters

How do we solve this problem?

ASCII and special characters

I'd like to talk about something I don't like seeing in code: non-ASCII characters! For example; I don't recommend putting "Vous êtes d'où?" in your code. It's better to use "Vous \u00eates d'o\u00f9?". It's dangerous to assume the encoding of the text will be the same for storage, transmission, etc. of your code. If your source code is converted to ASCII by accident, you'll lose every character with a value higher than 127!

The ASCII encoding contains every character necessary to write english text (values 32-126) along with control characters (0-31 and 127). Accented characters are providd by other standards, such as ISO 8859-1 (also known as Latin-1), UNICODE, etc.

When I write Java code, I always convert hard-coded strings containing special characters to Unicode using the following method F99_ConvertToUnicodeNotation:

  1. String s = "Vous êtes d'où?";
  2. System.out.print("\"");
  3. for (int i = 0; i < s.length(); i++) {
  4. char c = s.charAt(i);
  5. if (c > 31 && c < 127)
  6. System.out.print(String.valueOf(c));
  7. else
  8. System.out.print(String.format("\\u%04x", (int)c));
  9. }
  10. System.out.println("\"");

The result is the following:

"Vous \u00eates d'o\u00f9?"

This might be too much for Western languages, but it's most certainly a good idea when dealing with Cyrillic, Japanese, Chinese, Korean, etc. characters.

Fonts and glyphs

The encoding of the characters isn't the main issue with this code sample, otherwise we would have had issues with the French text as well. The reason for the missing Czech characters is simple; we're using a font that doesn't contain glyphs corresponding to the characters. And because we didn't specify a font, iText chose the default font: Helvetica, a "Standard Type 1" font, which doesn't contain any glyphs for non-Western languages.

Embedded fonts

Even worse, we're assuming that every reader or system reading our files will have access to every font. If we use a font without embedding it in the document, there's always a chance that a user won't be able to read our document because the font file is missing on his computer.

Fonts and encoding

And finally, we have to take the encoding of the font in mind. The Czech language is a central European language, so we can use the 1250 encoding.

Let's combine every solution into one code sample F03_Embedded:

  1. public static final String FONT = "resources/fonts/FreeSans.ttf";
  2.  
  3. Document document = new Document();
  4. PdfWriter.getInstance(document, new FileOutputStream(dest));
  5. document.open();
  6. Font font = FontFactory.getFont(FONT, "Cp1250", BaseFont.EMBEDDED);
  7. document.add(new Paragraph("Odkud jste?", font));
  8. document.add(new Paragraph("Uvid\u00edme se za chvilku. M\u011bj se.", font));
  9. document.add(new Paragraph("Dovolte, abych se p\u0159edstavil.", font));
  10. document.add(new Paragraph("To je studentka.", font));
  11. document.add(new Paragraph("V\u0161echno v po\u0159\u00e1dku?", font));
  12. document.add(new Paragraph("On je in\u017een\u00fdr. Ona je l\u00e9ka\u0159.", font));
  13. document.add(new Paragraph("Toto je okno.", font));
  14. document.add(new Paragraph("Zopakujte to pros\u00edm.", font));
  15. document.close();

Instead of Helvetica, we're using FreeSans, a free font which is distributed with every Linux distribution. We create a Font object by using the createFont() method of the FontFactory class with the following parameters:

  • the path to the font file

  • the encoding ("Cp1250")

  • a boolean value indicating if the font should be embedded

Do note that we've replaced the non-ASCII characters with their UNICODE representation.

As you can see, the result is correct. Every character / glyph is present:

Correct example in Czech
Correct example in Czech

And what about the Cyrillic characters?

It's frustrating to see how many people who copy paste code samples without understanding what the sample does. For instance, they would replace the Czech text with Russian text without changing the encoding F04_Russian:

  1. Document document = new Document();
  2. PdfWriter.getInstance(document, new FileOutputStream(dest));
  3. document.open();
  4. Font font = FontFactory.getFont(FONT, "Cp1250", BaseFont.EMBEDDED);
  5. document.add(new Paragraph("\u041e\u0442\u043a\u0443\u0434\u0430 \u0442\u044b?", font));
  6. document.add(new Paragraph("\u0423\u0432\u0438\u0434\u0438\u043c\u0441\u044f \u0432 \u043d\u0435\u043c\u043d\u043e\u0433\u043e. \u0423\u0432\u0438\u0434\u0438\u043c\u0441\u044f.", font));
  7. document.add(new Paragraph("\u041f\u043e\u0437\u0432\u043e\u043b\u044c\u0442\u0435 \u043c\u043d\u0435 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u0438\u0442\u044c\u0441\u044f.", font));
  8. document.add(new Paragraph("\u042d\u0442\u043e \u0441\u0442\u0443\u0434\u0435\u043d\u0442.", font));
  9. document.add(new Paragraph("\u0425\u043e\u0440\u043e\u0448\u043e?", font));
  10. document.add(new Paragraph("\u041e\u043d \u0438\u043d\u0436\u0435\u043d\u0435\u0440. \u041e\u043d\u0430 \u0434\u043e\u043a\u0442\u043e\u0440.", font));
  11. document.add(new Paragraph("\u042d\u0442\u043e \u043e\u043a\u043d\u043e.", font));
  12. document.add(new Paragraph("\u041f\u043e\u0432\u0442\u043e\u0440\u0438\u0442\u0435, \u043f\u043e\u0436\u0430\u043b\u0443\u0439\u0441\u0442\u0430.", font));
  13. document.close();

It is obvious that this won't work:

Russian encoding: incorrect
Russian encoding: incorrect

For eastern European languages, and more specifically, for the Cyrillic alphabet, we'll need encoding 1251 F05_Russian_correct_encoding:

  1. Font font = FontFactory.getFont(FONT, "Cp1251", BaseFont.EMBEDDED);

And now we can see the Russian text:

Russian encoding: correct
Russian encoding: correct

The problem with encoding...

If you want to mix different languages in one document (e.g. French, Russian, Japanese, ...) then you'll need to define different encoding F06_Different_encoding.pdf:

  1. Document document = new Document();
  2. PdfWriter.getInstance(document, new FileOutputStream(dest));
  3. document.open();
  4. BaseFont bf1 = BaseFont.createFont(FONT, BaseFont.WINANSI, BaseFont.EMBEDDED);
  5. Font french = new Font(bf1, 12);
  6. BaseFont bf2 = BaseFont.createFont(FONT, BaseFont.CP1250, BaseFont.EMBEDDED);
  7. Font czech = new Font(bf2, 12);
  8. BaseFont bf3 = BaseFont.createFont(FONT, "Cp1251", BaseFont.EMBEDDED);
  9. Font russian = new Font(bf3, 12);
  10. document.add(new Paragraph("Vous \u00eates d'o\u00f9?", french));
  11. document.add(new Paragraph("\u00c0 tout \u00e0 l'heure. \u00c0 bient\u00f4t.", french));
  12. document.add(new Paragraph("Je me pr\u00e9sente.", french));
  13. document.add(new Paragraph("C'est un \u00e9tudiant.", french));
  14. document.add(new Paragraph("\u00c7a va?", french));
  15. document.add(new Paragraph("Il est ing\u00e9nieur. Elle est m\u00e9decin.", french));
  16. document.add(new Paragraph("C'est une fen\u00eatre.", french));
  17. document.add(new Paragraph("R\u00e9p\u00e9tez, s'il vous pla\u00eet.", french));
  18. document.add(new Paragraph("Odkud jste?", czech));
  19. document.add(new Paragraph("Uvid\u00edme se za chvilku. M\u011bj se.", czech));
  20. document.add(new Paragraph("Dovolte, abych se p\u0159edstavil.", czech));
  21. document.add(new Paragraph("To je studentka.", czech));
  22. document.add(new Paragraph("V\u0161echno v po\u0159\u00e1dku?", czech));
  23. document.add(new Paragraph("On je in\u017een\u00fdr. Ona je l\u00e9ka\u0159.", czech));
  24. document.add(new Paragraph("Toto je okno.", czech));
  25. document.add(new Paragraph("Zopakujte to pros\u00edm.", czech));
  26. document.add(new Paragraph("\u041e\u0442\u043a\u0443\u0434\u0430 \u0442\u044b?", russian));
  27. document.add(new Paragraph("\u0423\u0432\u0438\u0434\u0438\u043c\u0441\u044f \u0432 \u043d\u0435\u043c\u043d\u043e\u0433\u043e. \u0423\u0432\u0438\u0434\u0438\u043c\u0441\u044f.", russian));
  28. document.add(new Paragraph("\u041f\u043e\u0437\u0432\u043e\u043b\u044c\u0442\u0435 \u043c\u043d\u0435 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u0438\u0442\u044c\u0441\u044f.", russian));
  29. document.add(new Paragraph("\u042d\u0442\u043e \u0441\u0442\u0443\u0434\u0435\u043d\u0442.", russian));
  30. document.add(new Paragraph("\u0425\u043e\u0440\u043e\u0448\u043e?", russian));
  31. document.add(new Paragraph("\u041e\u043d \u0438\u043d\u0436\u0435\u043d\u0435\u0440. \u041e\u043d\u0430 \u0434\u043e\u043a\u0442\u043e\u0440.", russian));
  32. document.add(new Paragraph("\u042d\u0442\u043e \u043e\u043a\u043d\u043e.", russian));
  33. document.add(new Paragraph("\u041f\u043e\u0432\u0442\u043e\u0440\u0438\u0442\u0435, \u043f\u043e\u0436\u0430\u043b\u0443\u0439\u0441\u0442\u0430.", russian));
  34. document.close();

Note that this time we used a BaseFont object to create a Font object. This is equivalent to using the FontFactory method as we've done before.

If you examine the fonts inside the generated PDF, you';; find one font for each of the encodings, three different "embedded subsets" of the FreeSans font.

There are a few inconveniences to this approach. The fonts are used like simple fonts: each font can't define more than 256 characters. Obviously, this is not enough for languages such as Chinese. Another problem when using custom encodings, concerns the universal accessibility of the document (PDF/UA). The trend in PDF is to use Unicode and composite fonts.

Unicode

Instead of using the Winansi, Windows -1250, Windos-1251, ... encodings, we can use Unicode F07_Unicode:

  1. Document document = new Document();
  2. PdfWriter.getInstance(document, new FileOutputStream(dest));
  3. document.open();
  4. Font font = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
  5. document.add(new Paragraph("Vous \u00eates d'o\u00f9?", font));
  6. document.add(new Paragraph("\u00c0 tout \u00e0 l'heure. \u00c0 bient\u00f4t.", font));
  7. document.add(new Paragraph("Je me pr\u00e9sente.", font));
  8. document.add(new Paragraph("C'est un \u00e9tudiant.", font));
  9. document.add(new Paragraph("\u00c7a va?", font));
  10. document.add(new Paragraph("Il est ing\u00e9nieur. Elle est m\u00e9decin.", font));
  11. document.add(new Paragraph("C'est une fen\u00eatre.", font));
  12. document.add(new Paragraph("R\u00e9p\u00e9tez, s'il vous pla\u00eet.", font));
  13. document.add(new Paragraph("Odkud jste?", font));
  14. document.add(new Paragraph("Uvid\u00edme se za chvilku. M\u011bj se.", font));
  15. document.add(new Paragraph("Dovolte, abych se p\u0159edstavil.", font));
  16. document.add(new Paragraph("To je studentka.", font));
  17. document.add(new Paragraph("V\u0161echno v po\u0159\u00e1dku?", font));
  18. document.add(new Paragraph("On je in\u017een\u00fdr. Ona je l\u00e9ka\u0159.", font));
  19. document.add(new Paragraph("Toto je okno.", font));
  20. document.add(new Paragraph("Zopakujte to pros\u00edm.", font));
  21. document.add(new Paragraph("\u041e\u0442\u043a\u0443\u0434\u0430 \u0442\u044b?", font));
  22. document.add(new Paragraph("\u0423\u0432\u0438\u0434\u0438\u043c\u0441\u044f \u0432 \u043d\u0435\u043c\u043d\u043e\u0433\u043e. \u0423\u0432\u0438\u0434\u0438\u043c\u0441\u044f.", font));
  23. document.add(new Paragraph("\u041f\u043e\u0437\u0432\u043e\u043b\u044c\u0442\u0435 \u043c\u043d\u0435 \u043f\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u0438\u0442\u044c\u0441\u044f.", font));
  24. document.add(new Paragraph("\u042d\u0442\u043e \u0441\u0442\u0443\u0434\u0435\u043d\u0442.", font));
  25. document.add(new Paragraph("\u0425\u043e\u0440\u043e\u0448\u043e?", font));
  26. document.add(new Paragraph("\u041e\u043d \u0438\u043d\u0436\u0435\u043d\u0435\u0440. \u041e\u043d\u0430 \u0434\u043e\u043a\u0442\u043e\u0440.", font));
  27. document.add(new Paragraph("\u042d\u0442\u043e \u043e\u043a\u043d\u043e.", font));
  28. document.add(new Paragraph("\u041f\u043e\u0432\u0442\u043e\u0440\u0438\u0442\u0435, \u043f\u043e\u0436\u0430\u043b\u0443\u0439\u0441\u0442\u0430.", font));
  29. document.close();

In this sample, we use the same font (FreeSans.ttf), but we create a BaseFont object using the BaseFont.IDENTITY_H parameter for the encoding. Right now, there's only one font file in our PDF. The embedded font is also a composite font, which can contain 65535 characters, which is a lot more than the 256 characters a simple font can contain.

A quick "Did you know?"

The BaseFont.NOT_EMBEDDED parameter is ignored when combined with the BaseFont.IDENTITY_H parameter. This is shown in the last example of this post F08_Unicode;

  1. Font font = FontFactory.getFont(
  2. FONT, BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);

The font has been embedded, despite the BaseFont.NOT_EMBEDDED parameter. In this case, iText chooses to ignore this parameter because a PDF created with the Identity-H encoding without embedding the font violates the PDF specification:

Section 9.7.5.2:

The Identity-H and Identity-V CMaps shall not be used with a non-embedded font.

These simple examples explain, in a nutshell, how to avoid some of the most common mistakes when using special fonts in the iText PDF creation process.