- From: Ian Hickson <ian@hixie.ch>
- Date: Mon, 23 Dec 2002 10:47:04 +0000 (GMT)
- To: André-John Mas <ajmas@sympatico.ca>
- Cc: "www-talk@w3.org" <www-talk@w3.org>
On Sun, 22 Dec 2002, André-John Mas wrote: > > I have tried searching for documentation on URLs and double-byte > characters, even searched this mailing-list, but could find > nothing concrete. As far as I know, there is no specification which explicitly allows (or defines) codepoints outside the US-ASCII range. However, UAs appear to have adopted a convention of encoding URIs using UTF-8. > I have looked for some algorithms, but while they worked in the > majority of cases failed in a few special cases: > > - %20%3A%22 > -- is this a space followed by one double byte character, or > two single byte characters? Assuming that is UTF-8: 0x20 0x3A 0x22 0b00100000 0b00111010 0b00100010 Since all three have their most significant bit set to 0, they are all single byte characters, namely U+20, U+3A, and U+22. That, in US-ASCII, is a space, a colon, and a double quote character respectively. > - %3A%20%22 > -- single byte character, space, single byte character OR > double byte character, single byte character OR single > byte character, double byte character? Same characters, in a different order. > Using Mozilla I find that it encodes it utf-8 urls with a mixture > of single byte and double characters. Yes, it encodes the URI in UTF-8, which is a variable-byte-length encoding: characters in the range U+00000000 - U+0000007F are single byte, U+00000080 - U+000007FF are double byte, etc, up to U+03FFFFFF - U+7FFFFFFF, which have 6 bytes. For more information on the UTF-8 encoding algorithm, see ISO/IEC 10646-1 Annex R (Amendment 2). > For example, a space will be represented as %20, any reserved ASCII > character will use a single byte %xx value, but anything in chinese > will be defined using a double byte %xx%yy value. Actually most CJK characters take three bytes in a UTF-8 encoding. > This makes is very difficult to parse a URL. Assuming your system has UTF-8 APIs, as most systems do now, then it is as easy as converting the escapes into bytes, then treating the string as UTF-8 using the native API. > An RFC would be nice, so at least I know I am dealing with the same > solution with all modern web browsers. I believe (although I may be wrong) that it is currently undefined, but that typically UAs use UTF-8. MSDN may document IE's behaviour. I suggest experimenting. -- Ian Hickson )\._.,--....,'``. fL "meow" /, _.. \ _\ ;`._ ,. http://index.hixie.ch/ `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 23 December 2002 05:47:06 UTC