Can I generate a PDF from a URL instead of from a file on disk?

Tags: pdfHtmlURLHTML to PDF

This question was asked on Stack Overflow on Aug 14, '17 by Srinivas Ch

You can generate a PDF from any HTML InputStream. In most of the examples, we have used a FileOutputStream, but in chapter 4, we have created reports that existed only in memory as a byte[]. In that case, we used a ByteArrayInputStream. We can also use an InputStream that was created from a URL object.

Suppose that we use this URL:

public static final String ADDRESS = "https://stackoverflow.com/help/on-topic";

If we open this URL in a browser, we see the following page:

An IMDB page in the browser
An IMDB page in the browser

In the C07E04_CreateFromURL example, we use ADDRESS to create a Java URL object:

new C07E04_CreateFromURL().createPdf(new URL(ADDRESS), DEST);

We use the following createPdf() method:

public void createPdf(URL url, String dest) throws IOException {
    HtmlConverter.convertToPdf(url.openStream(), new FileOutputStream(dest));
}

The openStream() method gives us an InputStream that will be used by iText to get the HTML –obviously, this only works on a machine that has access to the internet.

For pages with lots of pictures, it can take a while for iText to download all the resources, but this FAQ page from the Stack Overflow should load quickly, and the result will look like this:

The IMDB page rendered to A4 pages in PDF
The IMDB page rendered to A4 pages in PDF

Maybe an A4 page isn't the ideal page size for a web page, because the complete sidebar is missing. Let's adapt the example, and introduce a media query.

The createPdf() method of the C07E05_CreateFromURL2.java example looks like this:

public void createPdf(URL url, String dest) throws IOException {
    PdfWriter writer = new PdfWriter(dest);
    PdfDocument pdf = new PdfDocument(writer);
    PageSize pageSize = new PageSize(850, 1700);
    pdf.setDefaultPageSize(pageSize);
    ConverterProperties properties = new ConverterProperties();
    MediaDeviceDescription mediaDeviceDescription =
        new MediaDeviceDescription(MediaType.SCREEN);
    mediaDeviceDescription.setWidth(pageSize.getWidth());
    properties.setMediaDeviceDescription(mediaDeviceDescription);
    HtmlConverter.convertToPdf(url.openStream(), pdf, properties);
}

We use a custom page size of 850 by 1700 user units, and we use the Screen media type as done in chapter 2. Now the content fits the page, and we get a much better result:

The IMDB page rendered to custom-sized pages in PDF
The IMDB page rendered to custom-sized pages in PDF

Sure, there are still some imperfections. For instance: the items in the header bar are shown as a list, instead of as items in a menu bar, but we plan to solve these issues in future versions of pdfHTML.

We could also have used the media type PRINT instead of SCREEN. See the C07E06_CreateFromURL3 example:

public void createPdf(URL url, String dest) throws IOException {
    ConverterProperties properties = new ConverterProperties();
    MediaDeviceDescription mediaDeviceDescription =
        new MediaDeviceDescription(MediaType.PRINT);
    properties.setMediaDeviceDescription(mediaDeviceDescription);
    HtmlConverter.convertToPdf(url.openStream(), new FileOutputStream(dest), properties);
}

Because of the print.css used by Stack Overflow, we now have a couple of bare bones pages in which the sidebar is omitted deliberately. Maybe that's exactly what we want:

The IMDB page rendered to A4 pages in PDF
The IMDB page rendered to A4 pages in PDF

Important: pdfHTML is a work in progress. If you have tried printing a web page to paper pages from a browser, you notice that the results aren't always quite as good as you'd want them to be. The same will be true when using pdfHTML as a URL2PDF tool. Most HTML pages aren't meant to be printed, but with pdfHTML, we're doing a continuous effort to improve the conversion process.