Date: Fri, 16 May 1997 23:10:23 +0200 (MET DST) From: "Martin J. Duerst" <email@example.com> To: Dan Oscarsson <Dan.Oscarsson@trab.se> cc: firstname.lastname@example.org, Gary.Adams@east.sun.com, email@example.com Subject: Re: Using UTF-8 for non-ASCII Characters in URLs In-Reply-To: <199705020952.LAA10593@valinor.malmo.trab.se> Message-ID: <Pine.SUN.3.96.970516224529.6801j-100000@enoshima> On Fri, 2 May 1997, Dan Oscarsson wrote: > > > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A> > > > > > > > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8, > > > > no matter what the user sees. > > > > > > > > > > If you use hex-encoding, yes. But NOT if you use the native character set > > > of the document. In that case, the 'this-is-the-URL' part must > > > use the same character set as the rest of the html document. Raw UTF-8 > > > may only be used in a UTF-8 encoded html document, not in a iso 8859-1 > > > encoded document. > > > > The document character set for HTML 2.0 and 3.2 was iso 8859-1. > > The document character set for HTML 4.0 and XML will be iso 10646. > As iso 8859-1 is a true subset of iso 10646 I assume that html 4.0 > will also handle iso 8859-1 encoded documents, otherwise it will break > a lot of html pages and software of today. The document character set for HTML, or XML, is not usually identical to the encoding ("charset") that is used for transmitting or storing the document. Please see the reference processing model in RFC 2070 for explanations. The use of raw encoding of some type inside a text document encoded with some other "charset" is in all cases very ill-advised. In fully implemented URLs including internationalization, there would be three possibilities for transmitting the characters of an URL (each of which could be used alternately for characters in the same URL): 1) Character encoded as UTF-8 and then encoded with %HH. 2) Character encoded as numeric character reference: &#nnnn;, where nnnn is the decimal number of the character in ISO 10646/Unicode. [In XML, and possibly also in future versions of HTML, there will be a variant of this, namely &#xhhhh;, where hhhh is the hexadecimal representation of the same character, in the same standards.] 3) Character encoded in the "charset" of the document. Some examples: a) The letter "w": 1) "%77"; 2) "w" [or "w"]; 3) "w" b) The letter u-umlaut: 1) "%C3%BC"; 2) "ü" [or "ü"]; 3) if the "charset" is iso-8859-1, then an octet 0xFC, not representable here. Alternatively, "ü", available only for certain characters. I guess this could go into Larry's draft more or less directly. Of course, we can add advice about preferred representations and deployment (for the moment, %HH is more stable than the others, except in trivial cases such as the "w" above). But people will start to type the characters into HTML URLs when they see them e.g. in their file system viewers and can type them e.g. into their browsers. That will happen just naturally. And we better made sure that things work instead of trying to rule it out. And they definitely work better with 3) than with raw UTF-8 in a document that is not encoded as UTF-8. This has been explained by Francois on his page quite some time ago :-). Regards, Martin.