Chapter 5: Custom tag workers and CSS appliers

Tags: pdfHtmlHTMLCSSCustom tagsCustom CSSiText 7eBookconverting HTML to PDF with pdfHTML

In this chapter, we'll change two of the most important internal mechanisms of the pdfHTML add-on.

  • We'll override the default functionality that matches HTML tags with iText objects, more specifically the DefaultTagWorkerFactory mechanism, and

  • We'll override the default functionality that matches CSS styles to iText styles, more specifically the DefaultCssApplierFactory mechanism.

Some of the examples will be rather artificial, but by examining them, you'll get a better insight into the inner workings of pdfHTML.

Changing the behavior of a tag

Up until now, we've always implicitly used the DefaultTagWorkerFactory class. This is a class that implements the ITagWorkerFactory interface. There is a single method in this factory interface: getTagWorker(IElementNode tag, ProcessorContext context).

  • The ElementNode object can give you information about the tag that is being processed (e.g. the name of the tag);

  • The ProcessorContext can give you access to the PdfDocument and several other converter properties.

This getTagWorker() method returns an ITagWorker instance.

The DefaultTagWorkerFactory implements the getTagWorker() method, and uses the DefaultTagWorkerMapping object to map the name of a tag to an ITagWorker instance. This mapping is stored in a TagProcessorMapping instance.

These are some examples of the default tag worker mapping:

  • The <p>-tag is mapped to the PTagWorker class,

  • The <span>-tag is mapped to the SpanTagWorker class,

  • The <a>-tag is mapped to the ATagWorker class, which extends the SpanTagWorker class,

  • The <b>-tag and <i>-tag are also mapped to the SpanTagWorker class. Those tags are considered to be special types of <span> tags.

  • And so on.

PTagWorker, SpanTagWorker, ATagWorker, and so on all implement the ITagWorker interface, more specifically, they implement the following four methods:

  1. processContent()– processes whatever text content is present inside the open and close tag of the element,

  2. processTagChild()– processes the other tags nested inside the open and close tag of the element,

  3. processEnd()– contains code that is executed after everything else is processed, and

  4. getElementResult()– can be used to retrieve the final result, an IPropertyContainer instance.

These tag worker classes are responsible for creating and populating iText objects. For instance: the PTagWorker class has a Paragraph object as a member-variable. When a <p> tag is encountered, the following steps take place:

  1. An instance of the Paragraph member-variable is created in the constructor of the PTagWorker object,

  2. Content is gathered in the processContent() and processTagChild() methods,

  3. the Paragraph is finalized in the processEnd() method, and

  4. the finalized Paragraph is returned by the getElementResult() method.

In the C05E01_ATagAsSpan example, we will take an example from chapter 2 (the 2_inline_css.html HTML file), but we'll change the tag worker factory in such a way that the <a>-tag is treated as a <span> tag. Figure 5.1 shows the resulting PDF.

Figure 5.1: A PDF with a link that isn't a link anymore
Figure 5.1: A PDF with a link that isn't a link anymore

This document looks exactly like the result we had in chapter 2, but when we try clicking the IMDB link in the actual PDF, nothing happens. The link in the HTML isn't a link in the PDF anymore because we changed the default tag worker factory with the setTagWorkerFactory() method of the ConverterProperties:

public void createPdf(String src, String dest) throws IOException {
    ConverterProperties converterProperties = new ConverterProperties();
    converterProperties.setTagWorkerFactory(
        new DefaultTagWorkerFactory() {
            @Override
            public ITagWorker getCustomTagWorker(
                IElementNode tag, ProcessorContext context) {
                    if ("a".equalsIgnoreCase(tag.name()) ) {
                        return new SpanTagWorker(tag, context);
                    }
                    return null;
                }
            } );
    HtmlConverter.convertToPdf(new File(src), new File(dest), converterProperties);
}

We could, of course, create a completely new implementation of the ITagWorkerFactory interface, but that would be a tremendous work. We want to benefit from as much existing pdfHTML functionality as possible, so that tags such as <div>, <h1>, and so on, are rendered correctly.

We can do so by reusing the functionality that is already present in the DefaultTagWorkerFactory. The DefaultTagWorkerFactory has a method named getCustomTagWorker() that always returns null. This method is always the first method that is invoked by the DefaultTagWorkerFactory's implementation of the getTagWorker() method.

  • If the getCustomTagWorker() method returns null –which is always the case unless you override the method–, then the default tag worker mapping is used. For instance: if an <a>-tag is encountered, an ATagWorker instance is returned.

  • If the getCustomTagWorker() method doesn't return null –which can be the case in our "overriden" version of the DefaultTagWorkerFactory–, the default mapping is ignored. For instance: if an <a>-tag is encountered in our example, a SpanTagWorker instance is returned instead of the ATagWorker you'd expect.

We didn't override any CSS appliers (yet), which explains why the word IMDB is underlined and rendered in blue, but since we treat the <a>-tag as if it were a <span>-tag from a functional and structural point of view, no link is added. This example was written to explain the inner workings of pdfHTML. What we have done in this first example could easily be perceived as the introduction of a bug.

This doesn't mean that there aren't any useful use cases. We could for instance extend the DefaultTagWorkerFactory to support custom tags.

Introducing custom tags

Suppose that we want to send an invitation letter to different people. We create this invitation letter in HTML (see invitation.html), but we have introduced two custom tags that aren't real, existing HTML syntax: <name> and <date>:

<html>
    <head>
        <title>Invitation to SXSW 2018</title>
    </head>
    <body>
        <u><b>Re: Invitation</b></u>
        <br>
        <p>Dear <name>SXSW visitor</name>,
        we hope you had a great SXSW film festival experience last year.
        And we would like to invite you to the next edition of SXSW Film
        that takes place from March 9 until March 17, 2018.</p>
        <p>Sincerely,<br>
        The SXSW crew<br>
        <date>August 4, 2017</date></p>
    </body>
</html>

We convert this HTML file to PDF in the C05E02_Invitation example using the createPdf() method:

public void createPdf(String src, String dest) throws IOException {
    HtmlConverter.convertToPdf(new File(src), new File(dest));
}

If we examine figure 5.2, we see "SXSW visitor" and "August 4, 2017" as regular text. There is no apparent difference with the rest of the content. The unknown tags are treated as ordinary <span> tags.

Figure 5.2: Custom tags are treated as <span>-tags
Figure 5.2: Custom tags are treated as <span>-tags

In the main() method of the C05E03_Invitations example, we take the same HTML file, but we use it to create three different PDF files:

String[] names = {"Bruno Lowagie", "Ingeborg Willaert", "John Doe"};
int counter = 1;
for (String name : names) {
    app.createPdf(name, SRC, String.format(DEST, counter++));
}

In the createPdf() method of this adapted example, we now create a variation on the DefaultTagWorkerFactory that returns a SpanTagWorker for the tags <name> and <date>. We override the processContent() method of these SpanTagWorker instances so that the actual content (content) is ignored. Instead a name or a today's date are passed as content.

public void createPdf(String name, String src, String dest) throws IOException {
    SimpleDateFormat sdf = new SimpleDateFormat("MMMM d, yyyy", Locale.ENGLISH);
    ConverterProperties converterProperties = new ConverterProperties();
    converterProperties.setTagWorkerFactory(
        new DefaultTagWorkerFactory() {
            @Override
            public ITagWorker getCustomTagWorker(
                IElementNode tag, ProcessorContext context) {
                if ("name".equalsIgnoreCase(tag.name()) ) {
                    return new SpanTagWorker(tag, context) {
                        @Override
                        public boolean processContent(
                            String content, ProcessorContext context) {
                            return super.processContent(name, context);
                        }
                    };
                }
                else if ("date".equalsIgnoreCase(tag.name()) ) {
                    return new SpanTagWorker(tag, context) {
                        @Override
                        public boolean processContent(
                            String content, ProcessorContext context) {
                            return super.processContent(
                                sdf.format(new Date()), context);
                        }
                    };
                }
                return null;
            }
        } );
    HtmlConverter.convertToPdf(new File(src), new File(dest), converterProperties);
}

This code will cause the default content ("SXSW visitor" and "August 4, 2017") to be replaced by specific names and today's date. See figure 5.3 for the resulting PDFs.

Figure 5.3: Custom tags are used to insert custom data
Figure 5.3: Custom tags are used to insert custom data

In some cases, it won't be possible to reuse one of the existing pdfHTML ITagWorker implementations. We'll have to create our own tag worker.

Creating your own ITagWorker implementation

Suppose that we want to create a PDF document that shows a QR code, for instance based on the following qrcode.html HTML file:

<html>
<head>
    <meta charset="UTF-8">
    <title>QRCode Example</title>
    <link rel="stylesheet" type="text/css" href="css/qrcode.css"/>
</head>
<body>
    <h1>SXXpress codes</h1>
    <p>Bruno Lowagie has a South by Express pass for the following movies:
    <p>Colossal</p>
    <qr charset="Cp437" errorcorrection="Q">
    Film: Colossal; Date: Friday, March 10; Time: 6:15 PM; Place: Alamo Lamar D
    </qr>
    <p>Mr. Roosevelt</p>
    <qr charset="Cp437" errorcorrection="L">
    Film: Mr. Roosevelt; Date: Sunday, March 12; Time: 2:15 PM; Place: Paramount Theatre
    </qr>
</body>
</html>

There is no such thing as a <qr>-tag in HTML, so when we open this HTML file in a browser, we only see the text. However, we can use this HTML file to create a PDF document with QR codes instead of the text. See figure 5.4.

Figure 5.4: Creating QRCodes from HTML
Figure 5.4: Creating QRCodes from HTML

The text in the HTML file has a red border because that's how we defined the style of the <qr>-tag in the qrcode.css CSS file:

qr {
    border:solid 1px red;
    height:200px;
    width:200px;
}

Now let's examine the C05E04_QRCode example to see how the PDF with the QR Code bar codes was created.

This time, we create a QRCodeTagWorkerFactory that extends the DefaultTagWorkerFactory class:

class QRCodeTagWorkerFactory extends DefaultTagWorkerFactory {
    @Override
    public ITagWorker getCustomTagWorker(IElementNode tag, ProcessorContext context) {
        if(tag.name().equals("qr")){
            return new QRCodeTagWorker(tag, context);
        }
        return null;
    }
}

If a <qr>-tag is encountered, we map it to a QRCodeTagWorker. This QRCodeTagWorker needs to implement the ITagWorker inferface:

  1. static class QRCodeTagWorker implements ITagWorker {
  2. private static String[] allowedErrorCorrection =
  3. {"L","M","Q","H"};
  4. private static String[] allowedCharset =
  5. {"Cp437","Shift_JIS","ISO-8859-1","ISO-8859-16"};
  6. private BarcodeQRCode qrCode;
  7. private Image qrCodeAsImage;
  8.  
  9. public QRCodeTagWorker(IElementNode element, ProcessorContext context){
  10. Map<EncodeHintType, Object> hints = new HashMap<>();
  11. String charset = element.getAttribute("charset");
  12. if(checkCharacterSet(charset)){
  13. hints.put(EncodeHintType.CHARACTER_SET, charset);
  14. }
  15. String errorCorrection = element.getAttribute("errorcorrection");
  16. if(checkErrorCorrectionAllowed(errorCorrection)){
  17. ErrorCorrectionLevel errorCorrectionLevel =
  18. getErrorCorrectionLevel(errorCorrection);
  19. hints.put(EncodeHintType.ERROR_CORRECTION, errorCorrectionLevel);
  20. }
  21. qrCode = new BarcodeQRCode("placeholder",hints);
  22. }
  23.  
  24. @Override
  25. public boolean processContent(String content, ProcessorContext context) {
  26. qrCode.setCode(content);
  27. return true;
  28. }
  29.  
  30. @Override
  31. public boolean processTagChild(
  32. ITagWorker childTagWorker, ProcessorContext context) {
  33. return false;
  34. }
  35.  
  36. @Override
  37. public void processEnd(IElementNode element, ProcessorContext context) {
  38. qrCodeAsImage = new Image(qrCode.createFormXObject(context.getPdfDocument()));
  39. }
  40.  
  41. @Override
  42. public IPropertyContainer getElementResult() {
  43. return qrCodeAsImage;
  44. }
  45.  
  46. private static boolean checkErrorCorrectionAllowed(String toCheck){
  47. for(int i = 0; i<allowedErrorCorrection.length;i++){
  48. if(toCheck.toUpperCase().equals(allowedErrorCorrection[i])){
  49. return true;
  50. }
  51. }
  52. return false;
  53. }
  54.  
  55. private static boolean checkCharacterSet(String toCheck){
  56. for(int i = 0; i<allowedCharset.length;i++){
  57. if(toCheck.equals(allowedCharset[i])){
  58. return true;
  59. }
  60. }
  61. return false;
  62. }
  63.  
  64. private static ErrorCorrectionLevel getErrorCorrectionLevel(String level){
  65. switch(level) {
  66. case "L":
  67. return ErrorCorrectionLevel.L;
  68. case "M":
  69. return ErrorCorrectionLevel.M;
  70. case "Q":
  71. return ErrorCorrectionLevel.Q;
  72. case "H":
  73. return ErrorCorrectionLevel.H;
  74. }
  75. return null;
  76. }
  77. }

The static String arrays in line 2 to 5 are possible values for hints that can be passed to the constructor of the BarcodeQRCode class, which is an iText class shipped with the barcodes jar. If you look at the qrcode.html HTML file, you can see that we pass these values using the attributes charset and errorcorrection.

In line 6 and 7, we see two member-variables, one is a BarcodeQRCode instance; the other an Image instance. The BarcodeQRCode object is a low-level object that knows how to draw a QR code, but eventually, we'll need a result that is an instance of the IPropertyContainer interface. To get such a result, we'll wrap the bar code inside an Image object. Rest assure: this operation won't change the QR code into a raster image; it won't reduce the resolution and legibility of the QR code. The Image class is perfectly capable of storing vector images such as bar codes created with iText's barcode functionality.

Line 9 to 22 shows the constructor of our custom QR code tag worker. In this constructor, we first create a Map with "hints". We retrieve the value of these hints from the attributes of the <qr>-tag. We retrieve the charset (line 11-14) and we retrieve the errorcorrection (line 15-20). We only accept allowed values, which we check with the checkCharacterSet() (line 55-62) and the checkErrorCorrectionAllowed() (line 46-53) method. We convert the errorcorrection level attribute to an ErrorCorrectionLevel using the getErrorCorrectionLevel() method (line 64-76). Once we've processed the attributes, we create a BarcodeQRCode instance. Since we don't know the content of the barcode yet, we use "placeholder" as value for the code (line 21).

We replace this "placeholder" by the actual content in the processContent() method (line 24-28). We don't expect the <qr>-tag to have any nested tags, hence the processTagChild() method can simply return false (line 30-34). At the end of the process, in the processEnd() method, we wrap the qrCode object in an Image object (line 36-39). As opposed to the BarcodeQRCode object, the Image class implements the IPropertyContainter interface. We return this IPropertyContainer object in the getElementResult() method (line 41-45).

We have successfully implemented the four methods of the ITagWorker interface, and we can now use this QRCodeTagWorker in our custom QRCodeTagWorkerFactory.

We use this custom QRCodeTagWorkerFactory as one of the ConverterProperties in the createPdf() method:

public void createPdf(String src, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    properties
        .setCssApplierFactory(new QRCodeTagCssApplierFactory())
        .setTagWorkerFactory(new QRCodeTagWorkerFactory());
    HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}

As you can tell from this snippet, we use the setTagWorkerFactory() method to extend the tag processing mechanism, but we also used the setCssApplierFactory() method to extend the CSS functionality.

This is necessary in this example, because we defined a border color, a width, and a height for the <qr>-tag using CSS. However, since <qr> is a custom tag, iText doesn't know which implementation of the ICssApplier to use when such a tag is encountered. We can easily fix this by overriding the getCustomCssApplier() method of the DefaultCssApplierFactory class:

class QRCodeTagCssApplierFactory extends DefaultCssApplierFactory {
    @Override
    public ICssApplier getCustomCssApplier(IElementNode tag) {
        if (tag.name().equals("qr")) {
            return new BlockCssApplier();
        }
        return null;
    }
}

In this code snippet, we tell iText that whenever a <qr>-tag is encountered, we need to treat this tag as a block element. We simply reuse the BlockCssApplier that is used for block elements such as <div>, <p>, <blockquote>, and so on. You can inspect the source code of the DefaultTagCssApplierMapping for a complete overview of the default ICssApplier mapping.

In the next couple of examples, we'll create custom CSS appliers.

Creating custom CSS appliers

In the previous example, we extended the DefaultCssApplierFactory to add a BlockCssApplier for the <qr>-tag. In the next example, we'll create our own ICssApplier implementation.

In this rather artificial example, we'll take the sxsw.html HTML file from chapter 3, and we'll override the CSS applier for all <div>-elements. We'll ignore all the styles for these elements, except the background. If a background color is defined for a <div>-element, we'll replace the actual value with a gray value (#dddddd). See figure 5.5.

Figure 5.5: existing background for <div>-elements turn gray
Figure 5.5: existing background for <div>-elements turn gray

The ICssApplier interface has a single method, named apply(). In the C05E05_GrayBackground example, we implement this method like this:

class GrayBackgroundBlockCssApplier implements ICssApplier {
    public void apply(ProcessorContext context,
        IStylesContainer stylesContainer, ITagWorker tagWorker){
        Map<String, String> cssProps = stylesContainer.getStyles();
        IPropertyContainer container = tagWorker.getElementResult();
        if (container != null && cssProps.containsKey(CssConstants.BACKGROUND_COLOR)) {
            cssProps.put(CssConstants.BACKGROUND_COLOR, "#dddddd");
            BackgroundApplierUtil.applyBackground(cssProps, context, container);
        }
    }
}

If there is a background color, we replace that color with #dddddd, and we apply the background to the container with the BackgroundApplier class. You can still see elements with a colored background in the resulting PDF. That's because those backgrounds were defined in the context of an <h2> or an <li> element, and our custom ICssApplier is only used for <div> elements.

Be very careful when you implement this kind of functionality. In this case, we are totally ignoring any other CSS properties that might have been applicable. It's much better to extend an existing ICssApplier implementation.

In the next example, we are going to use some "Dutch CSS" to define colors. Since there is no such thing as "Dutch CSS", we're going to extend the DefaultCssApplierFactory to support this fictitious CSS functionality.

Implementing your own custom CSS

The dutch_css.html HTML file is a variation on the files we used in chapter 2, but there is something peculiar about it. It uses a Dutch version of CSS:

<html>
    <head>
        <title>Colossal</title>
        <meta name="description" content="Gloria is an out-of-work party girl..." />
    </head>
    <body>
        <img src="img/colossal.jpg" style="width: 120px;float: right" />
        <h1 style="achtergrond: rood; kleur: wit;">Colossal (2016)</h1>
        <div style="font-style: italic; kleur: blauw;">Directed by Nacho Vigalondo</div>
        <div style="kleur: groen;">
        Gloria is an out-of-work party girl forced to leave her life in New York City,
        and move  back home. When reports surface that a giant creature is
        destroying Seoul, she gradually comes to the realization that she is
        somehow connected to this phenomenon.
        </div>
        <div style="font-size: 0.8em">Read more about this movie on
        <a href="www.imdb.com/title/tt4680182">IMDB</a></div>
    </body>
</html>

Do you see how we defined the colors in this HTML file? We used achtergrond instead of background, kleur instead of color, and we used the colors wit, rood, groen, and blauw. Obviously, this is not going to work in a browser, but we can make it work when we convert the HTML to PDF. See figure 6.6.

Figure 6.6: Making Dutch CSS work when converting HTML to PDF
Figure 6.6: Making Dutch CSS work when converting HTML to PDF

To achieve this, we mapped Dutch names of colors to English color names in the C05E06_DutchCss example:

public static final Map<String, String> KLEUR = new HashMap<String, String>();
static {
    KLEUR.put("wit", "white");
    KLEUR.put("zwart", "black");
    KLEUR.put("rood", "red");
    KLEUR.put("groen", "green");
    KLEUR.put("blauw", "blue");
}

We also extended the BlockCssApplier class:

  1. class DutchColorCssApplier extends BlockCssApplier {
  2. @Override
  3. public void apply(ProcessorContext context,
  4. IStylesContainer stylesContainer, ITagWorker tagWorker){
  5. Map<String, String> cssStyles = stylesContainer.getStyles();
  6. if(cssStyles.containsKey("kleur")){
  7. cssStyles.put(CssConstants.COLOR,
  8. KLEUR.get(cssStyles.get("kleur")));
  9. stylesContainer.setStyles(cssStyles);
  10. }
  11. if(cssStyles.containsKey("achtergrond")){
  12. cssStyles.put(CssConstants.BACKGROUND_COLOR,
  13. KLEUR.get(cssStyles.get("achtergrond")));
  14. stylesContainer.setStyles(cssStyles);
  15. }
  16. super.apply(context, stylesContainer,tagWorker);
  17. }
  18. }

Whenever we encounter the Dutch CSS property "kleur" (line 6), we translate its value to English (line 8), and set this value as a color property (line 7). Whenever we encounter the Dutch CSS property achtergrond (line 11), we translate its value to English (line 13), and we set this value as the background property (line 12). For all the other CSS property, we rely on their implementation in the BlockCssApplier class (line 16).

We'll use this DutchColorCssApplier for every <h1>- and <div>-tag:

public void createPdf(String src, String dest) throws IOException {
    ConverterProperties converterProperties = new ConverterProperties();
    converterProperties.setCssApplierFactory(new DefaultCssApplierFactory() {
        ICssApplier dutchCssColor = new DutchColorCssApplier();
        @Override
        public ICssApplier getCustomCssApplier(IElementNode tag) {
            if(tag.name().equals(TagConstants.H1)
                || tag.name().equals(TagConstants.DIV)){
                return dutchCssColor;
            }
            return null;
        }
    });
    HtmlConverter.convertToPdf(new File(src), new File(dest), converterProperties);
}

I leave it to your imagination in which use case this would actually be useful, but I hope that these examples provide some insights in the inner workings of pdfHTML, and at the same time prove that the pdfHTML add-on is highly extensible. Using the mechanisms described in this chapter, you can adapt pdfHTML to your own needs and requirements.

Summary

In this chapter, we changed the core functionality of the pdfHTML add-on by changing the way tags and CSS are interpreted. We (deliberately) introduced a bug that made an <a>-tag behave as if it were an ordinary <span>-tag. We introduced custom tags that served as placeholders for names, dates, and even QR Codes. We downgraded the CSS properties for <div>-tags so that every CSS property except the background would be ignored. Finally, we introduced Dutch CSS to define colors, and we made this CSS work in the HTML to PDF conversion process.

In the next chapter, we'll discuss a topic that is long overdue in this tutorial: fonts. So far, we've only used fonts such as Helvetica and FreeSans, but which other fonts can we use? We'll discover this in the next chapter.