- From: by way of Martin Duerst <Paul.Deuter@plumtree.com>
- Date: Wed, 16 May 2001 10:27:42 +0900
- To: www-international@w3.org
To Whom This May Concern, I am writing this email to suggest the need for an explicit encoding format to specify a Unicode codepoint in an HTTP stream. Some recent proposals suggest that to encode a character as Unicode, first convert to UTF-8 and then format each octet as %HH and send it out. My experience with query strings, cookies, and form data is that user agents do not encode first in UTF-8 before formatting octets as %HH. Rather I have found that the %HH format is context sensitive and is an agreement between the sender and the receiver. Only when a page is specifically sent down to a user agent in UTF-8, will the user agent return data in the %HH format in UTF-8. Since most html pages are still in character sets other than UTF-8, this means that the usage of the %HH format to mean UTF-8 is quite rare. Since the %HH format seems already to be widely used to encode "any" character set, I cannot see how any software will be able to change over and start interpreting %HH as UTF-8. Rather it seems to me that what is needed is an new HTTP encoding that explicitly indicates a Unicode codepoint analogous to the &#xHHHH; format that what invented for this very purpose for HTML. In my investigations, I have already seen that some user agents will encode Unicode using the %uHHHH format. I have also seen that some servers already interpret %uHHHH as Unicode. Since the %uHHHH format is not currently an allowed sequence in HTTP - I believe that it could be adopted as an extension to the current HTTP specification. (This belief is partially bolstered by the knowledge that some servers already interpret %uHHHH as a Unicode codepoint.) But whatever format is chosen, I think we need a format and I don't see the %HH ever being implemented as such because too much existing web software would break. Regards, Paul Deuter Plumtree Software paul.deuter@plumtree.com
Received on Tuesday, 15 May 2001 21:48:54 UTC