Re: Using UTF-8 for non-ASCII Characters in URLs

Larry Masinter (masinter@parc.xerox.com)
Wed, 30 Apr 1997 14:31:30 PDT


Message-ID: <3367BA32.6588@parc.xerox.com>
Date: Wed, 30 Apr 1997 14:31:30 PDT
From: Larry Masinter <masinter@parc.xerox.com>
To: Francois Yergeau <yergeau@alis.com>
CC: uri@bunyip.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs

Francois,

I suggested:
><A HREF="this-is-the-URL">this-is-what-the-user-sees</A>
>
>The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8,
>no matter what the user sees.

and you responded:

"That would break with current practice.  Please see
<http://www.alis.com/~yergeau/url-00.html>, section 4 for a discussion
of this issue."

However, I'm not aware of any current practice that does what section 4
suggests, namely:

"This shows the path to be followed with non-ASCII URLs embedded in a
text file: simply encode the characters of the URL in the same way as
the other characters of the document, i.e. using the CCS of the
document. If a character in the URL is not part of the repertoire of
this CCS, use URL-encoding of the UTF-8 representation to preserve that
character's identity."

You would require a different transcoding mechanism for the URL and for
the rest of the document. Normally, transcoding a Unicode document in
HTML into ISO-8859-1 requires converting characters outside of 0-255
into numeric character references; however, you are suggesting turning
URLs into hex-encoded UTF-8 instead. Right?

Could you clarify what current practice would "break"?