Re: URL internationalization!

Martin J. Duerst (mduerst@ifi.unizh.ch)
Wed, 26 Feb 1997 15:04:48 +0100 (MET)


Date: Wed, 26 Feb 1997 15:04:48 +0100 (MET)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Jonathan Rosenne <Jonathan_Rosenne@CompuServe.com>
Cc: URI List <uri@bunyip.com>
Subject: Re: URL internationalization!
In-Reply-To: <199702251306_MC2-11B1-87E5@compuserve.com>
Message-Id: <Pine.SUN.3.95q.970226145545.245G-100000@enoshima>

On Tue, 25 Feb 1997, Jonathan Rosenne wrote:

> > As an example,
> > let's take a resource name with a G with breve (U+011E). Let's
> > assume that on the server, resource names are encoded in iso-8859-3.
> > Then the G with breve contains appears as %AB in a well-formed
> > URL. Now suppose somebody put that URL into an HTML document
> > that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains
> > the octet 0xAB for the G with breve character), and that that
> > document is correctly tagged as iso-8859-3.
> >
> > Now assume a browser sends a request with
> >       Accept-Charset: iso-8859-5
> > The server (or a proxy) translates the whole document from
> > iso-8859-3 to iso-8859-5 to honor the request of the browser.
> > The G with breve gets changed to 0xD0. The client receives
> > the 0xD0. If it "behaves the same as if it had received the
> > corresponding %XX", i.e. %D0, the URL will not work at all.
> 
> I don't understand. What if the user uses 8859-8, which has no G-breve? I
> mean, what if it says Accept-Charset: iso-8859-8?

Then this depends on the sophistication of the transcoding
server/proxy. For (i18n) HTML, the obvious solution is to
replace the G-breve with &#286;, the decimal value of U+011E.

For formats other than HTML, we might be out of luck. The server/
proxy may convert it to a sequence %HH%HH corresponding to G-breve
in UTF-8 if it is sure that the G-breve is in an URL. But it is
much more difficult to decide what could be an URL in an arbitrary
format than to replace all unrepresentable characters by numeric
character references in HTML (which can be done irrespective of
whether it is an URL or something else.

This is an additional reason for why we should be careful with
the introduction of natively encoded URLs, and why I am abstaining
for the moment to fully propose it.


Regards,	Martin.