- From: Alain LaBont/e'/ <alb@riq.qc.ca>
- Date: Sat, 17 May 1997 10:57:34 -0400
- To: "Martin J. Duerst" <mduerst@ifi.unizh.ch>, Dan Oscarsson <Dan.Oscarsson@trab.se>
- Cc: masinter@parc.xerox.com, Gary.Adams@east.sun.com, uri@bunyip.com
A 23:10 97-05-16 +0200, Martin J. Duerst a écrit : >On Fri, 2 May 1997, Dan Oscarsson wrote: > >> > > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A> >> > > > >> > > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8, >> > > > no matter what the user sees. >> > > > >> > > >> > > If you use hex-encoding, yes. But NOT if you use the native character set >> > > of the document. In that case, the 'this-is-the-URL' part must >> > > use the same character set as the rest of the html document. Raw UTF-8 >> > > may only be used in a UTF-8 encoded html document, not in a iso 8859-1 >> > > encoded document. >> > >> > The document character set for HTML 2.0 and 3.2 was iso 8859-1. >> > The document character set for HTML 4.0 and XML will be iso 10646. >> As iso 8859-1 is a true subset of iso 10646 I assume that html 4.0 >> will also handle iso 8859-1 encoded documents, otherwise it will break >> a lot of html pages and software of today. [Martin] : >The document character set for HTML, or XML, is not usually identical >to the encoding ("charset") that is used for transmitting or storing >the document. >Please see the reference processing model in RFC 2070 for explanations. >The use of raw encoding of some type inside a text document encoded >with some other "charset" is in all cases very ill-advised. > >In fully implemented URLs including internationalization, there would >be three possibilities for transmitting the characters of an URL >(each of which could be used alternately for characters in the same URL): > >1) Character encoded as UTF-8 and then encoded with %HH. > >2) Character encoded as numeric character reference: &#nnnn;, > where nnnn is the decimal number of the character in > ISO 10646/Unicode. [In XML, and possibly also in future > versions of HTML, there will be a variant of this, > namely &#xhhhh;, where hhhh is the hexadecimal representation > of the same character, in the same standards.] > >3) Character encoded in the "charset" of the document. > > >Some examples: > >a) The letter "w": 1) "%77"; 2) "w" [or "w"]; 3) "w" > >b) The letter u-umlaut: 1) "%C3%BC"; 2) "ü" [or "ü"]; > 3) if the "charset" is iso-8859-1, then an octet 0xFC, not > representable here. Alternatively, "ü", available > only for certain characters. > > >I guess this could go into Larry's draft more or less directly. >Of course, we can add advice about preferred representations >and deployment (for the moment, %HH is more stable than the others, >except in trivial cases such as the "w" above). > >But people will start to type the characters into HTML URLs when >they see them e.g. in their file system viewers and can type them >e.g. into their browsers. That will happen just naturally. And >we better made sure that things work instead of trying to >rule it out. And they definitely work better with 3) than with >raw UTF-8 in a document that is not encoded as UTF-8. This has >been explained by François on his page quite some time ago :-). I could not agree more with what Martin says... I'm very pleased... That describes reality in a very concise way... Alain LaBonté Québec
Received on Saturday, 17 May 1997 11:31:35 UTC