What is the difference between getPageLabels and getPageLabelFormats?

Tags: page labelsPdfReaderiText 5

I have a program that calls PdfPageLabels.getPageLabels() and PdfPageLabels.getPageLabelFormats() on the same PdfReader object on successive lines of my code:

PdfPageLabels.PdfPageLabelFormat[] pplf = 
    PdfPageLabels.getPageLabelFormats(reader);
String[] labs = PdfPageLabels.getPageLabels(reader);
I would have expected the two calls always to return arrays of the same length, they are supposed to be the same labels. This is true most of the time, but occasionally this is NOT the case.

I have an example. It's a 150Mb PDF file which appears to have 4670 labels via getPageLabels(), but only 1 via getPageLabelFormats(). So my question is: Under what circumstances could the two calls return arrays of different lengths?

Posted on StackOverflow on Dec 3, 2015 by paulb

The difference between both methods is simple:

  • getPageLabels() returns the label of every page in an array. If your PDF has 4670 pages, you will get an array with 4670 String values.

  • getPageLabelFormat() returns an array with the formats that are used in the document. It doesn't return String values, but PdfPageLabelFormat instances. In many cases, there is only one page label format used throughout the document.

For example:

You have a document with an intro of five pages, numbered i, ii, iii, iv and v. Then you have a hundred pages, numbers 1 to 100.

In this case, getPageLabels() should return an array with 105 String values. The getPageLabelFormat() method however, will only return two PageLabelFormat values because we are only using two page label formats:

  • one saying that the first physical page starts with lowercase roman numbers starting with i.

  • one saying that the sixth physical page starts with arabic numbers, starting with 1.

Only the start format is needed, physical page 2 to 4 have the same format as physical page 1; physical page 7 to 105 have the same format as page 6.