Re: Using UTF-8 for non-ASCII Characters in URLs

Larry Masinter (masinter@parc.xerox.com)
Wed, 30 Apr 1997 01:00:27 PDT


Message-ID: <3366FC1B.EA8@parc.xerox.com>
Date: Wed, 30 Apr 1997 01:00:27 PDT
From: Larry Masinter <masinter@parc.xerox.com>
To: Dan Oscarsson <Dan.Oscarsson@trab.se>
CC: uri@bunyip.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs

Dan,

> This is not right. A directory listing service generates a html document
> that is sent back to the web browser. All URLs within a html document
> should use the same character set as the document uses. That is, 
> if the document uses iso 8859-1, the URLs will be in iso 8859-1, and
> if the document is in UTF-8, the URLs will be in UTF-8.

Dan, for each item in a directory listing, there are two entries.

<A HREF="this-is-the-URL">this-is-what-the-user-sees</A>

The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8,
no matter what the user sees.

I'll try to make clear that the recommendation for how URLs should
be processed really only applies to the URLs and not to anything
else that isn't a URL.

> If the browser knows how to handle the character set of the html document,
> it also should know how to translate the embedded URLs into UTF-8 when
> the user follows a link.

I think you've missed the whole point. A browser that knows
ISO-8859-1 and KOI-8 can continue to only process directory
listings from servers that have files whose file names
are in Japanese.

> In general, URLs used without a context that defines the characters used,
> should be encoded using UTF-8. URLs used within a context where the
> meaning of the characters is defined should use the character encoding
> of the context.

I suppose you're entitled to this opinion that thats how they "should"
be encoded, but this is a different recommendation from those being
promoted by others on this mailing list.

If you want to make a counter-proposal, you're free to do so, but
I don't think you have described anything that is actually workable.

Larry