- From: Paul Deuter <Paul.Deuter@plumtree.com>
- Date: Tue, 15 May 2001 21:04:52 -0700
- To: "'Albert-Lunde@northwestern.edu'" <Albert-Lunde@northwestern.edu>, www-international@w3.org
My proposal concerns all parts of an HTTP stream - although I would not expect any software to use the HTTP encoding inside HTML (since HTML already has a way to encode Unicode characters). See below for a snippet of a sniffer trace on a very simple test using ASP. You will see that the %HH format can appear in a cookie (i.e. an HTTP header) or a query string (i.e. the part after the "?". HTTP is being used for a lot of different purposes these days and it is not uncommon for a gateway to intercept an HTTP and extract or stuff data into it. If the gateway keeps its own data in Unicode (as all good software should) - then the gateway will naturally want a way to add the data to the stream without having to determine what the character set of the stream is and also to avoid the character set conversion (to avoid potential loss of data). Cookies are a good example. The user agent usually does not interpret them but rather just sends them back in the same format. In this case, it is very handy to be able to insert data into the stream in Unicode regardless of the character set of the stream. -Paul SNIFFER SNIPPET: ================ POST /s_charset.asp?Case8='Ü&Case9=%82%DC&Case10=%u307E&Case11=%E3%81%BE HTTP/1.1 Accept: application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-comet, */* Referer: http://intl-portal/s_charset2.asp Accept-Language: es,ko;q=0.92,zh-tw;q=0.83,it;q=0.75,en-us;q=0.67,fr;q=0.58,de;q=0.50,no;q=0. 42,ja;q=0.33,no;q=0.25,ru;q=0.17,pt;q=0.08 Content-Type: application/x-www-form-urlencoded Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0) Host: intl-portal Content-Length: 13 Connection: Keep-Alive Cookie: Case4=%u307E; Case5='Ü; Case6=%82%DC; Case7=%E3%81%BE; Case3=%25u307E; Case2=%2582%25DC; Case1=%3F; ASPSESSIONIDQQGGGNHC=JHBFIHCANHIKOCBHNBFMNGOA Case12=%82%DCHTTP/1.1 100 Continue Server: Microsoft-IIS/5.0 Date: Fri, 04 May 2001 22:51:21 GMT Paul Deuter Internationalization Manager Plumtree Software paul.deuter@plumtree.com -----Original Message----- From: Albert-Lunde@northwestern.edu [mailto:Albert-Lunde@northwestern.edu] Sent: Tuesday, May 15, 2001 8:46 PM To: www-international@w3.org Subject: Re: Unicode in HTTP streams > Some recent proposals suggest that to encode a character as Unicode, first > convert to UTF-8 and then format each octet as %HH and send it out. My > experience with query strings, cookies, and form data is that user agents do > not encode first in UTF-8 before formatting octets as %HH. Rather I have > found that the %HH format is context sensitive and is an agreement between > the sender and the receiver. Only when a page is specifically sent down to > a user agent in UTF-8, will the user agent return data in the %HH format in > UTF-8. Since most html pages are still in character sets other than UTF-8, > this means that the usage of the %HH format to mean UTF-8 is quite rare. [...] > Rather it seems to me that what is needed is an new HTTP encoding that > explicitly indicates a Unicode codepoint analogous to the &#xHHHH; format > that what invented for this very purpose for HTML. In my investigations, I Are you talking about the encoding of a URL on the method line of an HTTP request, the encoding of a request body, or the encoding of a response body? These aren't always the same thing in theory or practice. It _sounds_ like you are talking about the encoding of URLs. -- Albert Lunde Albert-Lunde@northwestern.edu (new address) Albert-Lunde@nwu.edu (old address)
Received on Wednesday, 16 May 2001 00:06:54 UTC