URLs and double byte characters (unicode) from André-John Mas on 2002-12-22 (www-talk@w3.org from November to December 2002)

From: André-John Mas <ajmas@sympatico.ca>
Date: Sun, 22 Dec 2002 10:12:05 -0500
To: www-talk@w3.org
Message-Id: <BAFBC174-15BF-11D7-8CB5-003065D6B164@sympatico.ca>

Hi,

I have tried searching for documentation on URLs and double-byte
characters, even searched this mailing-list, but could find
nothing concrete.

For me the issue has arrisen because I am writing a servlet that
allows for the browsing of a virtual directory structure that in
certain cases have entries that have chinese names.

I have looked for some algorithms, but while they worked in the
majority of cases failed in a few special cases:

   - %20%3A%22
     -- is this a space followed by one double byte character, or
     two single byte characters?

   - %3A%20%22
     -- single byte character, space, single byte character OR
     double byte character, single byte character OR single
     byte character, double byte character?

Using Mozilla I find that it encodes it utf-8 urls with a mixture
of single byte and double characters. For example, a space will
be represented as %20, any reserved ASCII character will use a
single byte %xx value, but anything in chinese will be defined
using a double byte %xx%yy value. This makes is very difficult
to parse a URL. I would say that the problem is with Mozilla,
but for me the real problem is the lack of any documentation
on the issue. An RFC would be nice, so at least I know I am
dealing with the same solution with all modern web browsers.

regards

Andre

Received on Sunday, 22 December 2002 11:27:16 UTC