Unicode in HTTP streams

To Whom This May Concern,
I am writing this email to suggest the need for an explicit encoding format
to specify a Unicode codepoint in an HTTP stream.

Some recent proposals suggest that to encode a character as Unicode, first
convert to UTF-8 and then format each octet as %HH and send it out.  My
experience with query strings, cookies, and form data is that user agents do
not encode first in UTF-8 before formatting octets as %HH.  Rather I have
found that the %HH format is context sensitive and is an agreement between
the sender and the receiver.  Only when a page is specifically sent down to
a user agent in UTF-8, will the user agent return data in the %HH format in
UTF-8.  Since most html pages are still in character sets other than UTF-8,
this means that the usage of the %HH format to mean UTF-8 is quite rare.

Since the %HH format seems already to be widely used to encode "any"
character set, I cannot see how any software will be able to change over and
start interpreting %HH as UTF-8.

Rather it seems to me that what is needed is an new HTTP encoding that
explicitly indicates a Unicode codepoint analogous to the &#xHHHH; format
that what invented for this very purpose for HTML.  In my investigations, I
have already seen that some user agents will encode Unicode using the %uHHHH
format.  I have also seen that some servers already interpret %uHHHH as
Unicode.  Since the %uHHHH format is not currently an allowed sequence in
HTTP - I believe that it could be adopted as an extension to the current
HTTP specification.  (This belief is partially bolstered by the knowledge
that some servers already interpret %uHHHH as a Unicode codepoint.)

But whatever format is chosen, I think we need a format and I don't see the
%HH ever being implemented as such because too much existing web software
would break.

Regards,
Paul Deuter
Plumtree Software
paul.deuter@plumtree.com

Received on Tuesday, 15 May 2001 21:48:54 UTC