Re: URLs and double byte characters (unicode)

* Ian Hickson wrote:
>> Using Mozilla I find that it encodes it utf-8 urls with a mixture
>> of single byte and double characters.
>Yes, it encodes the URI in UTF-8, which is a variable-byte-length
>encoding: characters in the range U+00000000 - U+0000007F are single byte,
>U+00000080 - U+000007FF are double byte, etc, up to U+03FFFFFF -
>U+7FFFFFFF, which have 6 bytes.

That's news to me. My Mozilla does the following. Typing the URI
http://localhost:99/björn into the address bar the browser requests

  GET /bj%F6rn HTTP/1.1
  Host: localhost:99
  User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2b)
  Accept: text/xml,application/xml,application/xhtml+xml,
  Accept-Language: en-us, en;q=0.50
  Accept-Encoding: gzip, deflate, compress;q=0.9
  Accept-Charset: ISO-8859-1, utf-8;q=0.66, *;q=0.66
  Keep-Alive: 300
  Connection: keep-alive

That's ISO-8859-1 or a compatible encoding. Clicking on the link in the
following document

  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  <html xmlns="">
      <meta content='text/html;charset=iso-8859-1'
      <p><a href='http://localhost:99/björn'>...</a></p>
causes exactly the same request. This is worse for non-Latin1
characters, e.g. the euro sign U+20AC, Mozilla omits it from
the URI if there are no following characters, e.g.   => GET /  => GET /?€xx => GET /%80xx

%80 is Windows-1252 or a compatible encoding, UTF-8 would have been
%E2%82%AC. What did I miss? Since which version does Mozilla use UTF-8?

Internet Explorer 6 from adress bar does     => GET /%E2%82%ACörn => GET /bj%C3%B6rn    => GET /?€ (Windows-1252 encoded)

Same for Windows-1252 or compatible encoded documents. The mentioned
XHTML document in UTF-8 causes the URI to use unescaped UTF-8 octets
for the query part.

(This behaivour is documented in the Internet Explorer 5 readme.txt, the
path uses UTF-8, the query is not escaped, Internet Explorer 5 Beta also
used UTF-8 for the query part and escaped the octets properly).

Opera 6 always uses %hh escaped UTF-8 URIs (the desired behaivour [1]),
Opera 5 IIRC did not transcode or escape anything.

Not sure how those browsers deal with URIs from <form method = GET>.

There is from my point of view no consistent behaivour among these
browsers. The situation gets worse if those browsers try to decode URIs,
especially for fragment identifiers, see

[1] see appendix B.2.1 of HTML 4.01.


Received on Monday, 23 December 2002 07:57:07 UTC