Re: URLs and double byte characters (unicode) from Ian Hickson on 2002-12-23 (www-talk@w3.org from November to December 2002)

From: Ian Hickson <ian@hixie.ch>
Date: Mon, 23 Dec 2002 10:47:04 +0000 (GMT)
To: André-John Mas <ajmas@sympatico.ca>
Cc: "www-talk@w3.org" <www-talk@w3.org>
Message-ID: <Pine.LNX.4.21.0212230942110.22245-100000@dhalsim.dreamhost.com>

On Sun, 22 Dec 2002, André-John Mas wrote:
> 
> I have tried searching for documentation on URLs and double-byte
> characters, even searched this mailing-list, but could find
> nothing concrete.

As far as I know, there is no specification which explicitly allows (or
defines) codepoints outside the US-ASCII range. However, UAs appear to
have adopted a convention of encoding URIs using UTF-8.

> I have looked for some algorithms, but while they worked in the
> majority of cases failed in a few special cases:
> 
>    - %20%3A%22
>      -- is this a space followed by one double byte character, or
>      two single byte characters?

Assuming that is UTF-8:

        0x20       0x3A       0x22
  0b00100000 0b00111010 0b00100010

Since all three have their most significant bit set to 0, they are all
single byte characters, namely U+20, U+3A, and U+22. That, in US-ASCII, is
a space, a colon, and a double quote character respectively.

>    - %3A%20%22
>      -- single byte character, space, single byte character OR
>      double byte character, single byte character OR single
>      byte character, double byte character?

Same characters, in a different order.

> Using Mozilla I find that it encodes it utf-8 urls with a mixture
> of single byte and double characters.

Yes, it encodes the URI in UTF-8, which is a variable-byte-length
encoding: characters in the range U+00000000 - U+0000007F are single byte,
U+00000080 - U+000007FF are double byte, etc, up to U+03FFFFFF -
U+7FFFFFFF, which have 6 bytes.

For more information on the UTF-8 encoding algorithm, see ISO/IEC 10646-1
Annex R (Amendment 2).

> For example, a space will be represented as %20, any reserved ASCII
> character will use a single byte %xx value, but anything in chinese
> will be defined using a double byte %xx%yy value.

Actually most CJK characters take three bytes in a UTF-8 encoding.

> This makes is very difficult to parse a URL.

Assuming your system has UTF-8 APIs, as most systems do now, then it is as
easy as converting the escapes into bytes, then treating the string as
UTF-8 using the native API.

> An RFC would be nice, so at least I know I am dealing with the same
> solution with all modern web browsers.

I believe (although I may be wrong) that it is currently undefined, but
that typically UAs use UTF-8. MSDN may document IE's behaviour. I suggest
experimenting.

-- 
Ian Hickson                                      )\._.,--....,'``.    fL
"meow"                                          /,   _.. \   _\  ;`._ ,.
http://index.hixie.ch/                         `._.-(,_..'--(,_..'`-.;.'

Received on Monday, 23 December 2002 05:47:06 UTC