RE: Unicode in HTTP streams

My proposal concerns all parts of an HTTP stream - although I would not
expect any software to use the HTTP encoding inside HTML (since HTML already
has a way to encode Unicode characters).

See below for a snippet of a sniffer trace on a very simple test using ASP.
You will see that the %HH format can appear in a cookie (i.e. an HTTP
header) or a query string (i.e. the part after the "?".

HTTP is being used for a lot of different purposes these days and it is not
uncommon for a gateway to intercept an HTTP and extract or stuff data into
it.  If the gateway keeps its own data in Unicode (as all good software
should) - then the gateway will naturally want a way to add the data to the
stream without having to determine what the character set of the stream is
and also to avoid the character set conversion (to avoid potential loss of
data).

Cookies are a good example.  The user agent usually does not interpret them
but rather just sends them back in the same format.  In this case, it is
very handy to be able to insert data into the stream in Unicode regardless
of the character set of the stream.

-Paul

SNIFFER SNIPPET:
================

POST /s_charset.asp?Case8='Ü&Case9=%82%DC&Case10=%u307E&Case11=%E3%81%BE
HTTP/1.1
Accept: application/vnd.ms-powerpoint, application/vnd.ms-excel,
application/msword, image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
application/x-comet, */*
Referer: http://intl-portal/s_charset2.asp
Accept-Language:
es,ko;q=0.92,zh-tw;q=0.83,it;q=0.75,en-us;q=0.67,fr;q=0.58,de;q=0.50,no;q=0.
42,ja;q=0.33,no;q=0.25,ru;q=0.17,pt;q=0.08
Content-Type: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)
Host: intl-portal
Content-Length: 13
Connection: Keep-Alive
Cookie: Case4=%u307E; Case5='Ü; Case6=%82%DC; Case7=%E3%81%BE;
Case3=%25u307E; Case2=%2582%25DC; Case1=%3F;
ASPSESSIONIDQQGGGNHC=JHBFIHCANHIKOCBHNBFMNGOA

Case12=%82%DCHTTP/1.1 100 Continue
Server: Microsoft-IIS/5.0
Date: Fri, 04 May 2001 22:51:21 GMT


Paul Deuter
Internationalization Manager
Plumtree Software
paul.deuter@plumtree.com 
 


-----Original Message-----
From: Albert-Lunde@northwestern.edu
[mailto:Albert-Lunde@northwestern.edu]
Sent: Tuesday, May 15, 2001 8:46 PM
To: www-international@w3.org
Subject: Re: Unicode in HTTP streams


> Some recent proposals suggest that to encode a character as Unicode, first
> convert to UTF-8 and then format each octet as %HH and send it out.  My
> experience with query strings, cookies, and form data is that user agents
do
> not encode first in UTF-8 before formatting octets as %HH.  Rather I have
> found that the %HH format is context sensitive and is an agreement between
> the sender and the receiver.  Only when a page is specifically sent down
to
> a user agent in UTF-8, will the user agent return data in the %HH format
in
> UTF-8.  Since most html pages are still in character sets other than
UTF-8,
> this means that the usage of the %HH format to mean UTF-8 is quite rare.
[...]
> Rather it seems to me that what is needed is an new HTTP encoding that
> explicitly indicates a Unicode codepoint analogous to the &#xHHHH; format
> that what invented for this very purpose for HTML.  In my investigations,
I

Are you talking about the encoding of a URL on the method line
of an HTTP request, the encoding of a request body, or the encoding
of a response body? These aren't always the same thing in theory
or practice. It _sounds_ like you are talking about the encoding
of URLs.

--
    Albert Lunde          Albert-Lunde@northwestern.edu (new address)
                          Albert-Lunde@nwu.edu (old address)

Received on Wednesday, 16 May 2001 00:06:54 UTC