W3C home > Mailing lists > Public > www-international@w3.org > April to June 2001

RE: Unicode in HTTP streams

From: Paul Deuter <Paul.Deuter@plumtree.com>
Date: Tue, 15 May 2001 21:04:52 -0700
Message-ID: <7CF2C62D332FD4118A4800B0D02165B901B05EDA@plumexchange1.plumtree.com>
To: "'Albert-Lunde@northwestern.edu'" <Albert-Lunde@northwestern.edu>, www-international@w3.org
My proposal concerns all parts of an HTTP stream - although I would not
expect any software to use the HTTP encoding inside HTML (since HTML already
has a way to encode Unicode characters).

See below for a snippet of a sniffer trace on a very simple test using ASP.
You will see that the %HH format can appear in a cookie (i.e. an HTTP
header) or a query string (i.e. the part after the "?".

HTTP is being used for a lot of different purposes these days and it is not
uncommon for a gateway to intercept an HTTP and extract or stuff data into
it.  If the gateway keeps its own data in Unicode (as all good software
should) - then the gateway will naturally want a way to add the data to the
stream without having to determine what the character set of the stream is
and also to avoid the character set conversion (to avoid potential loss of

Cookies are a good example.  The user agent usually does not interpret them
but rather just sends them back in the same format.  In this case, it is
very handy to be able to insert data into the stream in Unicode regardless
of the character set of the stream.



POST /s_charset.asp?Case8='&Case9=%82%DC&Case10=%u307E&Case11=%E3%81%BE
Accept: application/vnd.ms-powerpoint, application/vnd.ms-excel,
application/msword, image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
application/x-comet, */*
Referer: http://intl-portal/s_charset2.asp
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)
Host: intl-portal
Content-Length: 13
Connection: Keep-Alive
Cookie: Case4=%u307E; Case5='; Case6=%82%DC; Case7=%E3%81%BE;
Case3=%25u307E; Case2=%2582%25DC; Case1=%3F;

Case12=%82%DCHTTP/1.1 100 Continue
Server: Microsoft-IIS/5.0
Date: Fri, 04 May 2001 22:51:21 GMT

Paul Deuter
Internationalization Manager
Plumtree Software

-----Original Message-----
From: Albert-Lunde@northwestern.edu
Sent: Tuesday, May 15, 2001 8:46 PM
To: www-international@w3.org
Subject: Re: Unicode in HTTP streams

> Some recent proposals suggest that to encode a character as Unicode, first
> convert to UTF-8 and then format each octet as %HH and send it out.  My
> experience with query strings, cookies, and form data is that user agents
> not encode first in UTF-8 before formatting octets as %HH.  Rather I have
> found that the %HH format is context sensitive and is an agreement between
> the sender and the receiver.  Only when a page is specifically sent down
> a user agent in UTF-8, will the user agent return data in the %HH format
> UTF-8.  Since most html pages are still in character sets other than
> this means that the usage of the %HH format to mean UTF-8 is quite rare.
> Rather it seems to me that what is needed is an new HTTP encoding that
> explicitly indicates a Unicode codepoint analogous to the &#xHHHH; format
> that what invented for this very purpose for HTML.  In my investigations,

Are you talking about the encoding of a URL on the method line
of an HTTP request, the encoding of a request body, or the encoding
of a response body? These aren't always the same thing in theory
or practice. It _sounds_ like you are talking about the encoding
of URLs.

    Albert Lunde          Albert-Lunde@northwestern.edu (new address)
                          Albert-Lunde@nwu.edu (old address)
Received on Wednesday, 16 May 2001 00:06:54 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:20 UTC