form submission UTF-16/UTF-32 from Jungshik Shin on 2003-11-10 (www-international@w3.org from October to December 2003)

From: Jungshik Shin <jshin@i18nl10n.com>
Date: Tue, 11 Nov 2003 02:29:22 +0900 (KST)
To: www-international@w3.org
Cc: w3c-i18n-ig@w3.org
Message-ID: <Pine.LNX.4.58.0311110111280.2917@jshin.net>

Hi,

While trying to fix Mozilla bug 224820
(http://bugzilla.mozilla.org/show_bug.cgi?id=224820), I found that
both Opera and MS IE send POST data from UTF-16LE/UTF-16BE pages  in
url-escaped UTF-8 (when C-T is application/x-www-form-urlencoded).
To see that in action, go to
http://members.lycos.nl/slu/test2.php?enc=UTF-16LE, type in some
charcters and submit.  In case of MS IE, at first I thought it did
that because I didn't turn off the default option 'always send URLs
in UTF-8'. That is, I thought that the option would control not just the way
URLs(with path elements for GET) are sent but also how POST data is
submitted. However, even after I turned that off and restarted IE,
IE still sends POST data (from the form in UTF-16LE page) in UTF-8.

I was also pleasantly surprised to find that PHP5 (previously PHP's I18N
was not so good) can handle it properly. That part is certaily good
news.

On the other hand, it doesn't honor 'charset' parameter specified
in Content-Type header. So, even if I submitted url-encoded UTF-16LE
strings with the following header (with my patched version of Mozilla),
it doesn't interpret them as UTF-16LE.

Content-Type: application/x-www-form-urlencoded; charset=UTF-16LE

I'm all for UTF-8, but some people would like to send POST data in
url-encoded UTF-16 for some reason (the most frequently cited one
being the space advantage when dealing with CJK data). However, the
form submission section in HTML 4.x clearly was written with only
ASCII-preserving  byte-oriented character encodings in mind  so that
a couple of issues have to be settled. For instance, '+' is supposed
to replace space, but how should '+' be represented if the character
encoding is UTF-16LE? Latin letters in ASCII represent themselves
in byte-oriented encodings, but obviously doing so in UTF-16(LE|BE)
(i.e. with 0x00 0x41 for 'A' in UTF-16BE) doesn't work. So, the only
reliable way seems to be url-encode every character (i.e. for
U+0041 'A', use %00%41).

I'd like hear what others think about this issue.

Jungshik

Received on Monday, 10 November 2003 12:29:25 UTC