Date: Mon, 24 Feb 1997 17:09:06 +0100 (MET) From: "Martin J. Duerst" <firstname.lastname@example.org> To: "Alain LaBont/e'/" <email@example.com> Cc: Francois Yergeau <firstname.lastname@example.org>, Subject: Re: URL internationalization! In-Reply-To: <9702211454.AA12501@socrate.riq.qc.ca> Message-Id: <Pine.SUN.3.95q.970224164714.245O-100000@enoshima> On Fri, 21 Feb 1997, Alain LaBont/e'/ wrote: > @ 23:11 97-02-20 -0500, Francois Yergeau icrit : > > [given 8 bit per byte encoding] > > >Right. In fact, not only the system MUST NOT crash, but it SHOULD behave > >the same as if it had received the corresponding %XX. > > Ginial! Sorry, but it's not exactly as genial as it looks. As an example, let's take a resource name with a G with breve (U+011E). Let's assume that on the server, resource names are encoded in iso-8859-3. Then the G with breve contains appears as %AB in a well-formed URL. Now suppose somebody put that URL into an HTML document that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains the octet 0xAB for the G with breve character), and that that document is correctly tagged as iso-8859-3. Now assume a browser sends a request with Accept-Charset: iso-8859-5 The server (or a proxy) translates the whole document from iso-8859-3 to iso-8859-5 to honor the request of the browser. The G with breve gets changed to 0xD0. The client receives the 0xD0. If it "behaves the same as if it had received the corresponding %XX", i.e. %D0, the URL will not work at all. This is difficult to fix in the short term, but in the long term, once the convention that URLs use UTF-8 becomes popular, the client shouldn't "behave the same", but should take the character (namely the G with breve), encode it as UTF-8 and then with %HH, and then send it to the server. If we make recommendations as to what to do with an 8-bit encoded URL, we should definitely mention both possibilities, namely: - Interpret as octet directly and convert it to %HH - Interpret as character and convert to UTF-8 and then to %HH With this, we cover two cases: - The URL wasn't transcoded (not guaranteed, but quite frequent) - The server uses UTF-8 to encode characters (will become more and more frequent) The third case, namely that the URL gets transcoded, but the server doesn't support UTF-8, would be very difficult to cover, and is unrelated to the proposal of introducing UTF-8 as a recommended character encoding for URLs. Regards, Martin.