How to export Vietnamese text to PDF using iText?

Tags: languagestext extractionparsing HTMLiText 5

I'm facing a problem when trying to export a Vietnamese document as PDF using iText. I put Vietnamese words in .xml file like this

<td fontfamily="Helvetica" fontstyle="0" fontsize="9" align="0" colspan="48" lineoccupied="1">
    T\u1ED5 ch\u1EE9c tham gia
I convert this into Unicode, exporting the String to PDF using encoding UTF-8, but the program fails to display te Vietnamese characters '\u1ED5' and '\u1EE9' and the output becomes "T chc tham gia".

Posted on StackOverflow on Feb 28, 2014 by NTLC

There are several XML Worker examples involving Asian languages on the official iText web site. They parse an XHTML file containing Chinese characters, but it should be easy to adapt them to Vietnamese examples.

You can find the HTML files were going to parse here:

Both files contain the following text:

長空 (Broken Sword), 秦王殘劍 (Flying Snow), 飛雪 (Moon), 如月 (the King), and 秦王 (Sky).

In the first case, a font is defined using CSS:

<span style="font-size:12.0pt; font-family:MS Mincho">長空</span>

In the second case, no specific font is defined:

<body><p>長空 (Broken Sword), 秦王殘劍 (Flying Snow), 飛雪 (Moon), 如月 (the King), and 秦王 (Sky).</p></body>

These files contain UTF-8 characters, so we're going to parse them like this:

XMLWorkerHelper.getInstance().parseXHtml(writer, document,
            new FileInputStream(HTML), Charset.forName("UTF-8"));

The first thing you need, is a font that supports Vietnamese characters. That's something iText can't help you with. In your HTML file, you've defined Helvetica, but that's a standard Type1 font that is never embedded when using iText and that doesn't know how to draw Vietnamese glyphs. That's never going to work.

The first example D07_ParseHtmlAsian will automatically search for a font named MS Mincho. If it finds that font (for instance because you have msmincho.ttc in your Windows fonts directory), the font will show up in your PDF. See hero.pdf. If it doesn't find a font with that name, then the glyphs won't be visible, because you didn't provide any font program for those glyphs.

The second example D07bis_ParseHtmlAsian offers a workaround in case you don't have MS Mincho anywhere. In that case, you have to use an XMLWorkerFontProvider and register a font that can be used instead of MS Mincho. For instance: we use a font stored in the file cfmingeb.ttf and assign the alias MS Mincho:

XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.register("resources/fonts/cfmingeb.ttf", "MS Mincho");

The resulting file asian.pdf is slightly different from what we expect, but now we can at least see the Chinese glyphs.

In the third example, D07tris_ParseHtmlAsian, the HTML file doesn't tell us anything about the font that needs to be used. We'll define the font using CSS like this:

CSSResolver cssResolver = new StyleAttrCSSResolver();
CssFile cssFile = XMLWorkerHelper.getCSS(
    new ByteArrayInputStream("body {font-family:tsc fming s tt}".getBytes()));

Now, all the text in the body will use the font TSC FMing S TT (stored in the file cfmingeb.ttf). You can see the difference in the resulting PDF asian2.pdf.