Re: Using UTF-8 for non-ASCII Characters in URLs from Martin J. Duerst on 1997-05-02 (uri@w3.org from May 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Fri, 2 May 1997 18:19:43 +0200 (MET DST)
To: Larry Masinter <masinter@parc.xerox.com>
cc: "Michael Kung <MKUNG.US.ORACLE.COM>" <MKUNG@us.oracle.com>, uri@bunyip.com
Message-ID: <Pine.SUN.3.96.970502180918.245k-100000@enoshima>

On Tue, 29 Apr 1997, Larry Masinter wrote:

> This isn't just a "small point", it's essential:
> 
> The only way to guarantee "round trip" is to stick to the smallest
> repertoire of characters.

Yes. But it has to be qualified. It is the smallest set of
characters that you think your target audience is safely
able to distinguish and handle.

> Clearly you shouldn't enter "http" as
> wide characters,

That goes without saying, or doesn't it? Or a browser could
convert it to half-width characters (as a curtesy to the user,
not as part of any spec).

> and if you have 'wide characters' that need
> to be distinguished from ascii characters, you should encode them
> in hex-encoded-UTF8 always.

I think we have to distinguish two cases:

The case that the URL is just used as a carrier for transporting
information from point to point (FORM/QUERY): In this case,
both hex-encoded and 8-bit UTF-8 will work, as the binary
world is never left (but we know there are other problems with
querys, I am working towards a draft about them).

The case that URLs are passed around, on paper and so: In this
case, using %HH as a backup mechanism works, but it is no fun.
As there may be target audiences that can very well (actually
too well :-) distinguish between half-width and full-width
variants (e.g. East Asian programmers), it may very well be
possible to issue such URLs for such audiences. That's why
for such cases, I don't specify eqivalence nor normalization,
but I strongly discourage their use because they cannot
be safely distinguished by a wider audience.

Regards,	Martin.

Received on Friday, 2 May 1997 12:20:23 UTC