Re: Using UTF-8 for non-ASCII Characters in URLs

Sat, 17 May 1997 10:57:34 -0400

Date: Sat, 17 May 1997 10:57:34 -0400
Subject: Re: Using UTF-8 for non-ASCII Characters in URLs
A 23:10 97-05-16 +0200, Martin J. Duerst a écrit :
>On Fri, 2 May 1997, Dan Oscarsson wrote:
>> > > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A>
>> > > > 
>> > > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8,
>> > > > no matter what the user sees.
>> > > > 
>> > > 
>> > > If you use hex-encoding, yes. But NOT if you use the native
character set
>> > > of the document. In that case, the 'this-is-the-URL' part must
>> > > use the same character set as the rest of the html document. Raw UTF-8
>> > > may only be used in a UTF-8 encoded html document, not in a iso 8859-1
>> > > encoded document.
>> > 
>> > The document character set for HTML 2.0 and 3.2 was iso 8859-1.
>> > The document character set for HTML 4.0 and XML will be iso 10646.
>> As iso 8859-1 is a true subset of iso 10646 I assume that html 4.0
>> will also handle iso 8859-1 encoded documents, otherwise it will break
>> a lot of html pages and software of today.

[Martin] :
>The document character set for HTML, or XML, is not usually identical
>to the encoding ("charset") that is used for transmitting or storing
>the document.
>Please see the reference processing model in RFC 2070 for explanations.
>The use of raw encoding of some type inside a text document encoded
>with some other "charset" is in all cases very ill-advised.
>In fully implemented URLs including internationalization, there would
>be three possibilities for transmitting the characters of an URL
>(each of which could be used alternately for characters in the same URL):
>1) Character encoded as UTF-8 and then encoded with %HH.
>2) Character encoded as numeric character reference: &#nnnn;,
>	where nnnn is the decimal number of the character in
>	ISO 10646/Unicode. [In XML, and possibly also in future
>	versions of HTML, there will be a variant of this,
>	namely &#xhhhh;, where hhhh is the hexadecimal representation
>	of the same character, in the same standards.]
>3) Character encoded in the "charset" of the document.
>Some examples:
>a) The letter "w": 1) "%77";   2) "&#119" [or "&#x77"];   3) "w"
>b) The letter u-umlaut: 1) "%C3%BC";   2) "&#252" [or "&#xFC"];
>	3) if the "charset" is iso-8859-1, then an octet 0xFC, not
>		representable here. Alternatively, "&uuml;", available
>		only for certain characters.
>I guess this could go into Larry's draft more or less directly.
>Of course, we can add advice about preferred representations
>and deployment (for the moment, %HH is more stable than the others,
>except in trivial cases such as the "w" above).
>But people will start to type the characters into HTML URLs when
>they see them e.g. in their file system viewers and can type them
>e.g. into their browsers. That will happen just naturally. And
>we better made sure that things work instead of trying to
>rule it out. And they definitely work better with 3) than with
>raw UTF-8 in a document that is not encoded as UTF-8. This has
>been explained by François on his page quite some time ago :-).

I could not agree more with what Martin says... I'm very pleased...
That describes reality in a very concise way...

Alain LaBonté