Unicode in HTTP streams from by way of Martin Duerst on 2001-05-16 (www-international@w3.org from April to June 2001)

From: by way of Martin Duerst <Paul.Deuter@plumtree.com>
Date: Wed, 16 May 2001 10:27:42 +0900
To: www-international@w3.org
Message-Id: <4.2.0.58.J.20010516102734.0377b100@sh.w3.mag.keio.ac.jp>

To Whom This May Concern,
I am writing this email to suggest the need for an explicit encoding format
to specify a Unicode codepoint in an HTTP stream.

Some recent proposals suggest that to encode a character as Unicode, first
convert to UTF-8 and then format each octet as %HH and send it out.  My
experience with query strings, cookies, and form data is that user agents do
not encode first in UTF-8 before formatting octets as %HH.  Rather I have
found that the %HH format is context sensitive and is an agreement between
the sender and the receiver.  Only when a page is specifically sent down to
a user agent in UTF-8, will the user agent return data in the %HH format in
UTF-8.  Since most html pages are still in character sets other than UTF-8,
this means that the usage of the %HH format to mean UTF-8 is quite rare.

Since the %HH format seems already to be widely used to encode "any"
character set, I cannot see how any software will be able to change over and
start interpreting %HH as UTF-8.

Rather it seems to me that what is needed is an new HTTP encoding that
explicitly indicates a Unicode codepoint analogous to the &#xHHHH; format
that what invented for this very purpose for HTML.  In my investigations, I
have already seen that some user agents will encode Unicode using the %uHHHH
format.  I have also seen that some servers already interpret %uHHHH as
Unicode.  Since the %uHHHH format is not currently an allowed sequence in
HTTP - I believe that it could be adopted as an extension to the current
HTTP specification.  (This belief is partially bolstered by the knowledge
that some servers already interpret %uHHHH as a Unicode codepoint.)

But whatever format is chosen, I think we need a format and I don't see the
%HH ever being implemented as such because too much existing web software
would break.

Regards,
Paul Deuter
Plumtree Software
paul.deuter@plumtree.com

Received on Tuesday, 15 May 2001 21:48:54 UTC