Re: URL internationalization!

Martin J. Duerst (mduerst@ifi.unizh.ch)
Mon, 24 Feb 1997 17:09:06 +0100 (MET)


Date: Mon, 24 Feb 1997 17:09:06 +0100 (MET)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: "Alain LaBont/e'/" <alb@sct.gouv.qc.ca>
Cc: Francois Yergeau <yergeau@alis.com>,
Subject: Re: URL internationalization!
In-Reply-To: <9702211454.AA12501@socrate.riq.qc.ca>
Message-Id: <Pine.SUN.3.95q.970224164714.245O-100000@enoshima>

On Fri, 21 Feb 1997, Alain LaBont/e'/ wrote:

> @ 23:11 97-02-20 -0500, Francois Yergeau icrit :
> 
> [given 8 bit per byte encoding]
> 
> >Right.  In fact, not only the system MUST NOT crash, but it SHOULD behave
> >the same as if it had received the corresponding %XX.
> 
> Ginial!

Sorry, but it's not exactly as genial as it looks. As an example,
let's take a resource name with a G with breve (U+011E). Let's
assume that on the server, resource names are encoded in iso-8859-3.
Then the G with breve contains appears as %AB in a well-formed
URL. Now suppose somebody put that URL into an HTML document
that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains
the octet 0xAB for the G with breve character), and that that
document is correctly tagged as iso-8859-3.

Now assume a browser sends a request with
	Accept-Charset: iso-8859-5
The server (or a proxy) translates the whole document from
iso-8859-3 to iso-8859-5 to honor the request of the browser.
The G with breve gets changed to 0xD0. The client receives
the 0xD0. If it "behaves the same as if it had received the
corresponding %XX", i.e. %D0, the URL will not work at all.

This is difficult to fix in the short term, but in the long
term, once the convention that URLs use UTF-8 becomes popular,
the client shouldn't "behave the same", but should take the
character (namely the G with breve), encode it as UTF-8
and then with %HH, and then send it to the server. If we make
recommendations as to what to do with an 8-bit encoded
URL, we should definitely mention both possibilities,
namely:

- Interpret as octet directly and convert it to %HH
- Interpret as character and convert to UTF-8 and then to %HH

With this, we cover two cases:

- The URL wasn't transcoded (not guaranteed, but quite frequent)
- The server uses UTF-8 to encode characters (will become
	more and more frequent)

The third case, namely that the URL gets transcoded, but the
server doesn't support UTF-8, would be very difficult to
cover, and is unrelated to the proposal of introducing
UTF-8 as a recommended character encoding for URLs.

Regards,	Martin.