Re: URLs and double byte characters (unicode)

* André-John Mas wrote:
>I have tried searching for documentation on URLs and double-byte
>characters, even searched this mailing-list, but could find
>nothing concrete.

http://www.w3.org/International/O-URL-and-ident.html

>For me the issue has arrisen because I am writing a servlet that
>allows for the browsing of a virtual directory structure that in
>certain cases have entries that have chinese names.
>
>I have looked for some algorithms, but while they worked in the
>majority of cases failed in a few special cases:
>
>   - %20%3A%22
>     -- is this a space followed by one double byte character, or
>     two single byte characters?
>
>   - %3A%20%22
>     -- single byte character, space, single byte character OR
>     double byte character, single byte character OR single
>     byte character, double byte character?

The TAG seems to agree that only the server knows what %xx escaped
octets represent, see their recents minutes at

  http://www.w3.org/mid/3DFE544E.3050201@w3.org

If that's true, there are some errors in RFC 2396 that give a different
impression, see

  http://www.w3.org/mid/3e03179c.68187638@smtp.bjoern.hoehrmann.de

>Using Mozilla I find that it encodes it utf-8 urls with a mixture
>of single byte and double characters. For example, a space will
>be represented as %20, any reserved ASCII character will use a
>single byte %xx value, but anything in chinese will be defined
>using a double byte %xx%yy value. This makes is very difficult
>to parse a URL. I would say that the problem is with Mozilla,
>but for me the real problem is the lack of any documentation
>on the issue.

URIs with non-ASCII characters are invalid, thus you are responsible to
%xx escape your URI references properly. If you do this, no user agent I
am aware of will touch the URI and the server can deal with them as it
likes to. How to recover from inalid URIs is undefined.

Received on Monday, 23 December 2002 07:57:08 UTC