- From: Ian Hickson <ian@hixie.ch>
- Date: Mon, 23 Dec 2002 10:47:04 +0000 (GMT)
- To: André-John Mas <ajmas@sympatico.ca>
- Cc: "www-talk@w3.org" <www-talk@w3.org>
On Sun, 22 Dec 2002, André-John Mas wrote:
>
> I have tried searching for documentation on URLs and double-byte
> characters, even searched this mailing-list, but could find
> nothing concrete.
As far as I know, there is no specification which explicitly allows (or
defines) codepoints outside the US-ASCII range. However, UAs appear to
have adopted a convention of encoding URIs using UTF-8.
> I have looked for some algorithms, but while they worked in the
> majority of cases failed in a few special cases:
>
> - %20%3A%22
> -- is this a space followed by one double byte character, or
> two single byte characters?
Assuming that is UTF-8:
0x20 0x3A 0x22
0b00100000 0b00111010 0b00100010
Since all three have their most significant bit set to 0, they are all
single byte characters, namely U+20, U+3A, and U+22. That, in US-ASCII, is
a space, a colon, and a double quote character respectively.
> - %3A%20%22
> -- single byte character, space, single byte character OR
> double byte character, single byte character OR single
> byte character, double byte character?
Same characters, in a different order.
> Using Mozilla I find that it encodes it utf-8 urls with a mixture
> of single byte and double characters.
Yes, it encodes the URI in UTF-8, which is a variable-byte-length
encoding: characters in the range U+00000000 - U+0000007F are single byte,
U+00000080 - U+000007FF are double byte, etc, up to U+03FFFFFF -
U+7FFFFFFF, which have 6 bytes.
For more information on the UTF-8 encoding algorithm, see ISO/IEC 10646-1
Annex R (Amendment 2).
> For example, a space will be represented as %20, any reserved ASCII
> character will use a single byte %xx value, but anything in chinese
> will be defined using a double byte %xx%yy value.
Actually most CJK characters take three bytes in a UTF-8 encoding.
> This makes is very difficult to parse a URL.
Assuming your system has UTF-8 APIs, as most systems do now, then it is as
easy as converting the escapes into bytes, then treating the string as
UTF-8 using the native API.
> An RFC would be nice, so at least I know I am dealing with the same
> solution with all modern web browsers.
I believe (although I may be wrong) that it is currently undefined, but
that typically UAs use UTF-8. MSDN may document IE's behaviour. I suggest
experimenting.
--
Ian Hickson )\._.,--....,'``. fL
"meow" /, _.. \ _\ ;`._ ,.
http://index.hixie.ch/ `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 23 December 2002 05:47:06 UTC