- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Mon, 23 Dec 2002 13:57:25 +0100
- To: Ian Hickson <ian@hixie.ch>
- Cc: "www-talk@w3.org" <www-talk@w3.org>
* Ian Hickson wrote: >> Using Mozilla I find that it encodes it utf-8 urls with a mixture >> of single byte and double characters. > >Yes, it encodes the URI in UTF-8, which is a variable-byte-length >encoding: characters in the range U+00000000 - U+0000007F are single byte, >U+00000080 - U+000007FF are double byte, etc, up to U+03FFFFFF - >U+7FFFFFFF, which have 6 bytes. That's news to me. My Mozilla does the following. Typing the URI http://localhost:99/björn into the address bar the browser requests GET /bj%F6rn HTTP/1.1 Host: localhost:99 User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2b) Gecko/20021016 Accept: text/xml,application/xml,application/xhtml+xml, text/html;q=0.9,text/plain;q=0.8,video/x-mng,image/png,image/jpeg, image/gif;q=0.2,text/css,*/*;q=0.1 Accept-Language: en-us, en;q=0.50 Accept-Encoding: gzip, deflate, compress;q=0.9 Accept-Charset: ISO-8859-1, utf-8;q=0.66, *;q=0.66 Keep-Alive: 300 Connection: keep-alive That's ISO-8859-1 or a compatible encoding. Clicking on the link in the following document <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta content='text/html;charset=iso-8859-1' http-equiv='Content-Type'/> <title></title> </head> <body> <p><a href='http://localhost:99/björn'>...</a></p> </body> </html> causes exactly the same request. This is worse for non-Latin1 characters, e.g. the euro sign U+20AC, Mozilla omits it from the URI if there are no following characters, e.g. http://www.example.org/€ => GET / http://www.example.org/?€ => GET /? http://www.example.org/€xx => GET /%80xx %80 is Windows-1252 or a compatible encoding, UTF-8 would have been %E2%82%AC. What did I miss? Since which version does Mozilla use UTF-8? Internet Explorer 6 from adress bar does http://www.example.org/€ => GET /%E2%82%AC http://www.example.org/björn => GET /bj%C3%B6rn http://www.example.org/?€ => GET /?€ (Windows-1252 encoded) Same for Windows-1252 or compatible encoded documents. The mentioned XHTML document in UTF-8 causes the URI to use unescaped UTF-8 octets for the query part. (This behaivour is documented in the Internet Explorer 5 readme.txt, the path uses UTF-8, the query is not escaped, Internet Explorer 5 Beta also used UTF-8 for the query part and escaped the octets properly). Opera 6 always uses %hh escaped UTF-8 URIs (the desired behaivour [1]), Opera 5 IIRC did not transcode or escape anything. Not sure how those browsers deal with URIs from <form method = GET>. There is from my point of view no consistent behaivour among these browsers. The situation gets worse if those browsers try to decode URIs, especially for fragment identifiers, see http://www.w3.org/mid/3d9b65a0.3694262@smtp.bjoern.hoehrmann.de [1] see appendix B.2.1 of HTML 4.01. regards.
Received on Monday, 23 December 2002 07:57:07 UTC