- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Mon, 23 Dec 2002 13:57:25 +0100
- To: Ian Hickson <ian@hixie.ch>
- Cc: "www-talk@w3.org" <www-talk@w3.org>
* Ian Hickson wrote:
>> Using Mozilla I find that it encodes it utf-8 urls with a mixture
>> of single byte and double characters.
>
>Yes, it encodes the URI in UTF-8, which is a variable-byte-length
>encoding: characters in the range U+00000000 - U+0000007F are single byte,
>U+00000080 - U+000007FF are double byte, etc, up to U+03FFFFFF -
>U+7FFFFFFF, which have 6 bytes.
That's news to me. My Mozilla does the following. Typing the URI
http://localhost:99/björn into the address bar the browser requests
GET /bj%F6rn HTTP/1.1
Host: localhost:99
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2b)
Gecko/20021016
Accept: text/xml,application/xml,application/xhtml+xml,
text/html;q=0.9,text/plain;q=0.8,video/x-mng,image/png,image/jpeg,
image/gif;q=0.2,text/css,*/*;q=0.1
Accept-Language: en-us, en;q=0.50
Accept-Encoding: gzip, deflate, compress;q=0.9
Accept-Charset: ISO-8859-1, utf-8;q=0.66, *;q=0.66
Keep-Alive: 300
Connection: keep-alive
That's ISO-8859-1 or a compatible encoding. Clicking on the link in the
following document
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content='text/html;charset=iso-8859-1'
http-equiv='Content-Type'/>
<title></title>
</head>
<body>
<p><a href='http://localhost:99/björn'>...</a></p>
</body>
</html>
causes exactly the same request. This is worse for non-Latin1
characters, e.g. the euro sign U+20AC, Mozilla omits it from
the URI if there are no following characters, e.g.
http://www.example.org/€ => GET /
http://www.example.org/?€ => GET /?
http://www.example.org/€xx => GET /%80xx
%80 is Windows-1252 or a compatible encoding, UTF-8 would have been
%E2%82%AC. What did I miss? Since which version does Mozilla use UTF-8?
Internet Explorer 6 from adress bar does
http://www.example.org/€ => GET /%E2%82%AC
http://www.example.org/björn => GET /bj%C3%B6rn
http://www.example.org/?€ => GET /?€ (Windows-1252 encoded)
Same for Windows-1252 or compatible encoded documents. The mentioned
XHTML document in UTF-8 causes the URI to use unescaped UTF-8 octets
for the query part.
(This behaivour is documented in the Internet Explorer 5 readme.txt, the
path uses UTF-8, the query is not escaped, Internet Explorer 5 Beta also
used UTF-8 for the query part and escaped the octets properly).
Opera 6 always uses %hh escaped UTF-8 URIs (the desired behaivour [1]),
Opera 5 IIRC did not transcode or escape anything.
Not sure how those browsers deal with URIs from <form method = GET>.
There is from my point of view no consistent behaivour among these
browsers. The situation gets worse if those browsers try to decode URIs,
especially for fragment identifiers, see
http://www.w3.org/mid/3d9b65a0.3694262@smtp.bjoern.hoehrmann.de
[1] see appendix B.2.1 of HTML 4.01.
regards.
Received on Monday, 23 December 2002 07:57:07 UTC