Re: Using UTF-8 for non-ASCII Characters in URLs

Martin J. Duerst (mduerst@ifi.unizh.ch)
Fri, 16 May 1997 23:10:23 +0200 (MET DST)


Date: Fri, 16 May 1997 23:10:23 +0200 (MET DST)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Dan Oscarsson <Dan.Oscarsson@trab.se>
cc: masinter@parc.xerox.com, Gary.Adams@east.sun.com, uri@bunyip.com
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
In-Reply-To: <199705020952.LAA10593@valinor.malmo.trab.se>
Message-ID: <Pine.SUN.3.96.970516224529.6801j-100000@enoshima>

On Fri, 2 May 1997, Dan Oscarsson wrote:

> > > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A>
> > > > 
> > > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8,
> > > > no matter what the user sees.
> > > > 
> > > 
> > > If you use hex-encoding, yes. But NOT if you use the native character set
> > > of the document. In that case, the 'this-is-the-URL' part must
> > > use the same character set as the rest of the html document. Raw UTF-8
> > > may only be used in a UTF-8 encoded html document, not in a iso 8859-1
> > > encoded document.
> > 
> > The document character set for HTML 2.0 and 3.2 was iso 8859-1.
> > The document character set for HTML 4.0 and XML will be iso 10646.
> As iso 8859-1 is a true subset of iso 10646 I assume that html 4.0
> will also handle iso 8859-1 encoded documents, otherwise it will break
> a lot of html pages and software of today.

The document character set for HTML, or XML, is not usually identical
to the encoding ("charset") that is used for transmitting or storing
the document.
Please see the reference processing model in RFC 2070 for explanations.
The use of raw encoding of some type inside a text document encoded
with some other "charset" is in all cases very ill-advised.

In fully implemented URLs including internationalization, there would
be three possibilities for transmitting the characters of an URL
(each of which could be used alternately for characters in the same URL):

1) Character encoded as UTF-8 and then encoded with %HH.

2) Character encoded as numeric character reference: &#nnnn;,
	where nnnn is the decimal number of the character in
	ISO 10646/Unicode. [In XML, and possibly also in future
	versions of HTML, there will be a variant of this,
	namely &#xhhhh;, where hhhh is the hexadecimal representation
	of the same character, in the same standards.]

3) Character encoded in the "charset" of the document.


Some examples:

a) The letter "w": 1) "%77";   2) "&#119" [or "&#x77"];   3) "w"

b) The letter u-umlaut: 1) "%C3%BC";   2) "&#252" [or "&#xFC"];
	3) if the "charset" is iso-8859-1, then an octet 0xFC, not
		representable here. Alternatively, "&uuml;", available
		only for certain characters.


I guess this could go into Larry's draft more or less directly.
Of course, we can add advice about preferred representations
and deployment (for the moment, %HH is more stable than the others,
except in trivial cases such as the "w" above).

But people will start to type the characters into HTML URLs when
they see them e.g. in their file system viewers and can type them
e.g. into their browsers. That will happen just naturally. And
we better made sure that things work instead of trying to
rule it out. And they definitely work better with 3) than with
raw UTF-8 in a document that is not encoded as UTF-8. This has
been explained by Francois on his page quite some time ago :-).


Regards,	Martin.