How to define the data encoding when submitting a PDF form using AcroForm technology?

Tags: formssubmit formencodingiText 7

When I create a PDF form (for instance using Acrobat) that contains text fields in AcroForm format (PDF dictionaries, no XFA), and I submit the data to a server, how can I specify/retrieve the encoding that will be used?

For instance. When I submit the Chinese glyphs '测试' (test), I receive the following headers and content on the server-side:

accept: application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/, application/, application/msword, */*
content-type: application/x-www-form-urlencoded
content-length: 23
acrobat-version: 10.1.4
user-agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MDDC; .NET4.0C; AskTbCLA/
accept-encoding: gzip, deflate
connection: Keep-Alive
There's no reference to an encoding, except x-www-form-urlencoded. The two glyphs are represented as four bytes: B2 E2 CA D4. After some investigation, I know that B2E2 is the GBK value for the first glyph, and CAD4 the GBK value for the second glyph, but I can't derive this from the request header.

Where can I find information about the data encoding used when submitting PDF? Is it always GBK? Can I change the data encoding by setting a specific key in a dictionary in the PDF? For instance: can I make sure the PDF always sends Unicode characters instead of GBK?

Note that I've already experimented by changing the default font (and encoding) of the text field. I've also searched ISO-32000-1 for encodings in fields, but all I found was a way to define non-Latin characters for check boxes, and some info about the encoding of an FDF file. None of which answered my questions.

Posted on StackOverflow on Sep 26, 2012 by Bruno Lowagie

I didn't find anything in ISO-32000-1, but studying the Acrobat JavaScript reference, I found the cCharset parameter that is available for the submitForm() method. That parameter defines:

The encoding for the values submitted. String values are utf-8, utf-16, Shift-JIS, BigFive, GBK, and UHC. If not passed, the current Acrobat behavior applies. For XML-based formats, utf-8 is used. For other formats, Acrobat tries to find the best host encoding for the values being submitted. XFDF submission ignores this value and always uses utf-8.

In other words: in my case GBK was used because it fits best to submit Chinese characters. However, one could force UTF-8 by using the submitForm() JavaScript method using the appropriate value.

Based on this question, I have asked the ISO committee to fix this problem in ISO-32000-2. As a result, an extra possible entry was added to the table entitled Additional entries specific to a submit-form action in section

CharSet string: (Optional; inheritable) Possible values include: utf-8, utf-16, Shift-JIS, BigFive, GBK, or UHC.

Starting with PDF 2.0, this problem will no longer exist.

Click this link if you want to see how to answer this question in iText 5.