- From: Jungshik Shin <jshin@i18nl10n.com>
- Date: Tue, 11 Nov 2003 02:29:22 +0900 (KST)
- To: www-international@w3.org
- Cc: w3c-i18n-ig@w3.org
Hi, While trying to fix Mozilla bug 224820 (http://bugzilla.mozilla.org/show_bug.cgi?id=224820), I found that both Opera and MS IE send POST data from UTF-16LE/UTF-16BE pages in url-escaped UTF-8 (when C-T is application/x-www-form-urlencoded). To see that in action, go to http://members.lycos.nl/slu/test2.php?enc=UTF-16LE, type in some charcters and submit. In case of MS IE, at first I thought it did that because I didn't turn off the default option 'always send URLs in UTF-8'. That is, I thought that the option would control not just the way URLs(with path elements for GET) are sent but also how POST data is submitted. However, even after I turned that off and restarted IE, IE still sends POST data (from the form in UTF-16LE page) in UTF-8. I was also pleasantly surprised to find that PHP5 (previously PHP's I18N was not so good) can handle it properly. That part is certaily good news. On the other hand, it doesn't honor 'charset' parameter specified in Content-Type header. So, even if I submitted url-encoded UTF-16LE strings with the following header (with my patched version of Mozilla), it doesn't interpret them as UTF-16LE. Content-Type: application/x-www-form-urlencoded; charset=UTF-16LE I'm all for UTF-8, but some people would like to send POST data in url-encoded UTF-16 for some reason (the most frequently cited one being the space advantage when dealing with CJK data). However, the form submission section in HTML 4.x clearly was written with only ASCII-preserving byte-oriented character encodings in mind so that a couple of issues have to be settled. For instance, '+' is supposed to replace space, but how should '+' be represented if the character encoding is UTF-16LE? Latin letters in ASCII represent themselves in byte-oriented encodings, but obviously doing so in UTF-16(LE|BE) (i.e. with 0x00 0x41 for 'A' in UTF-16BE) doesn't work. So, the only reliable way seems to be url-encode every character (i.e. for U+0041 'A', use %00%41). I'd like hear what others think about this issue. Jungshik
Received on Monday, 10 November 2003 12:29:25 UTC