Re: Using UTF-8 for non-ASCII Characters in URLs from Martin J. Duerst on 1997-05-01 (uri@w3.org from May 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 1 May 1997 14:50:11 +0200 (MET DST)
To: Larry Masinter <masinter@parc.xerox.com>
cc: Francois Yergeau <yergeau@alis.com>, uri@bunyip.com
Message-ID: <Pine.SUN.3.96.970501143843.245M-100000@enoshima>

On Wed, 30 Apr 1997, Larry Masinter wrote:

>From Francois' web page:

> "This shows the path to be followed with non-ASCII URLs embedded in a
> text file: simply encode the characters of the URL in the same way as
> the other characters of the document, i.e. using the CCS of the
> document. If a character in the URL is not part of the repertoire of
> this CCS, use URL-encoding of the UTF-8 representation to preserve that
> character's identity."

Larry's comment:

> You would require a different transcoding mechanism for the URL and for
> the rest of the document. Normally, transcoding a Unicode document in
> HTML into ISO-8859-1 requires converting characters outside of 0-255
> into numeric character references; however, you are suggesting turning
> URLs into hex-encoded UTF-8 instead. Right?

Not exactly. Probably Francois' wording above ("is not part of the repertoire
of this CCS") should be a little bit different, saying something like
"cannot be represented in the document". RFC2070/Cougar/... conforming
html documents can represent the whole repertoire of ISO 10646/Unicode,
and there is therefore no must to translate to %HH. For automatic
transcoding of HTML documents, using &#nnn; is definitely possible,
and eaiser because it does not need parsing of the document. On the
other hand, a more sophisticated tool definitely could, and probably
should, use %HH, as the fact that the characters don't fit into
the underlying CCS is a strong indication that the target readers
may not be able to use the original form in further transcription.

Regards,	Martin.

Received on Thursday, 1 May 1997 08:54:04 UTC