- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Mon, 24 Feb 1997 17:09:06 +0100 (MET)
- To: "Alain LaBont/e'/" <alb@sct.gouv.qc.ca>
- Cc: Francois Yergeau <yergeau@alis.com>, "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>, URI mailing list <uri@bunyip.com>
On Fri, 21 Feb 1997, Alain LaBont/e'/ wrote: > @ 23:11 97-02-20 -0500, Francois Yergeau icrit : > > [given 8 bit per byte encoding] > > >Right. In fact, not only the system MUST NOT crash, but it SHOULD behave > >the same as if it had received the corresponding %XX. > > Ginial! Sorry, but it's not exactly as genial as it looks. As an example, let's take a resource name with a G with breve (U+011E). Let's assume that on the server, resource names are encoded in iso-8859-3. Then the G with breve contains appears as %AB in a well-formed URL. Now suppose somebody put that URL into an HTML document that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains the octet 0xAB for the G with breve character), and that that document is correctly tagged as iso-8859-3. Now assume a browser sends a request with Accept-Charset: iso-8859-5 The server (or a proxy) translates the whole document from iso-8859-3 to iso-8859-5 to honor the request of the browser. The G with breve gets changed to 0xD0. The client receives the 0xD0. If it "behaves the same as if it had received the corresponding %XX", i.e. %D0, the URL will not work at all. This is difficult to fix in the short term, but in the long term, once the convention that URLs use UTF-8 becomes popular, the client shouldn't "behave the same", but should take the character (namely the G with breve), encode it as UTF-8 and then with %HH, and then send it to the server. If we make recommendations as to what to do with an 8-bit encoded URL, we should definitely mention both possibilities, namely: - Interpret as octet directly and convert it to %HH - Interpret as character and convert to UTF-8 and then to %HH With this, we cover two cases: - The URL wasn't transcoded (not guaranteed, but quite frequent) - The server uses UTF-8 to encode characters (will become more and more frequent) The third case, namely that the URL gets transcoded, but the server doesn't support UTF-8, would be very difficult to cover, and is unrelated to the proposal of introducing UTF-8 as a recommended character encoding for URLs. Regards, Martin.
Received on Monday, 24 February 1997 11:09:19 UTC