Re: URLs and double byte characters (unicode)

* Ian Hickson wrote:
>> Using Mozilla I find that it encodes it utf-8 urls with a mixture
>> of single byte and double characters.
>
>Yes, it encodes the URI in UTF-8, which is a variable-byte-length
>encoding: characters in the range U+00000000 - U+0000007F are single byte,
>U+00000080 - U+000007FF are double byte, etc, up to U+03FFFFFF -
>U+7FFFFFFF, which have 6 bytes.

That's news to me. My Mozilla does the following. Typing the URI
http://localhost:99/björn into the address bar the browser requests

  GET /bj%F6rn HTTP/1.1
  Host: localhost:99
  User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2b)
    Gecko/20021016
  Accept: text/xml,application/xml,application/xhtml+xml,
    text/html;q=0.9,text/plain;q=0.8,video/x-mng,image/png,image/jpeg,
    image/gif;q=0.2,text/css,*/*;q=0.1
  Accept-Language: en-us, en;q=0.50
  Accept-Encoding: gzip, deflate, compress;q=0.9
  Accept-Charset: ISO-8859-1, utf-8;q=0.66, *;q=0.66
  Keep-Alive: 300
  Connection: keep-alive

That's ISO-8859-1 or a compatible encoding. Clicking on the link in the
following document

  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  
  <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <meta content='text/html;charset=iso-8859-1'
            http-equiv='Content-Type'/>
      <title></title>
    </head>
    <body>
      <p><a href='http://localhost:99/björn'>...</a></p>
    </body>
  </html>
  
causes exactly the same request. This is worse for non-Latin1
characters, e.g. the euro sign U+20AC, Mozilla omits it from
the URI if there are no following characters, e.g.

  http://www.example.org/€   => GET /
  http://www.example.org/?€  => GET /?
  http://www.example.org/€xx => GET /%80xx

%80 is Windows-1252 or a compatible encoding, UTF-8 would have been
%E2%82%AC. What did I miss? Since which version does Mozilla use UTF-8?

Internet Explorer 6 from adress bar does

  http://www.example.org/€     => GET /%E2%82%AC
  http://www.example.org/björn => GET /bj%C3%B6rn
  http://www.example.org/?€    => GET /?€ (Windows-1252 encoded)

Same for Windows-1252 or compatible encoded documents. The mentioned
XHTML document in UTF-8 causes the URI to use unescaped UTF-8 octets
for the query part.

(This behaivour is documented in the Internet Explorer 5 readme.txt, the
path uses UTF-8, the query is not escaped, Internet Explorer 5 Beta also
used UTF-8 for the query part and escaped the octets properly).

Opera 6 always uses %hh escaped UTF-8 URIs (the desired behaivour [1]),
Opera 5 IIRC did not transcode or escape anything.

Not sure how those browsers deal with URIs from <form method = GET>.

There is from my point of view no consistent behaivour among these
browsers. The situation gets worse if those browsers try to decode URIs,
especially for fragment identifiers, see

  http://www.w3.org/mid/3d9b65a0.3694262@smtp.bjoern.hoehrmann.de

[1] see appendix B.2.1 of HTML 4.01.

regards.

Received on Monday, 23 December 2002 07:57:07 UTC