Re: URL internationalization!

Dan Oscarsson (Dan.Oscarsson@trab.se)
Tue, 25 Feb 1997 14:09:38 +0100 (MET)


Date: Tue, 25 Feb 1997 14:09:38 +0100 (MET)
From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Message-Id: <199702251309.OAA06854@valinor.malmo.trab.se>
To: alb@sct.gouv.qc.ca, mduerst@ifi.unizh.ch
Subject: Re: URL internationalization!
Cc: yergeau@alis.com, fielding@kiwi.ICS.UCI.EDU, uri@bunyip.com



There are a few more things to think about 8 bits versus %XX.

> > [given 8 bit per byte encoding]
> > 
> > >Right.  In fact, not only the system MUST NOT crash, but it SHOULD behave
> > >the same as if it had received the corresponding %XX.
> As an example,
> let's take a resource name with a G with breve (U+011E). Let's
> assume that on the server, resource names are encoded in iso-8859-3.
> Then the G with breve contains appears as %AB in a well-formed
> URL. Now suppose somebody put that URL into an HTML document
> that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains
> the octet 0xAB for the G with breve character), and that that
> document is correctly tagged as iso-8859-3.
> 
> Now assume a browser sends a request with
> 	Accept-Charset: iso-8859-5
> The server (or a proxy) translates the whole document from
> iso-8859-3 to iso-8859-5 to honor the request of the browser.
> The G with breve gets changed to 0xD0. The client receives
> the 0xD0. If it "behaves the same as if it had received the
> corresponding %XX", i.e. %D0, the URL will not work at all.

As Martin points out there are a few problems, they exist mostly
because the URL used today does not define a how to encode characters
outside ascii.

If we define that an URL that is sent using the transport format of
an URL with all characters encoded using UTF-8, it is no problem.
In a html document the URL can be represented using 8-bit octets
encoded in iso 8859-3 as of above. When the document is transcoded
(are there any servers that do that?) the URL is changed into the
same URL, but encoded in iso 8859-5 (the G with breve is still a
G with breve). When the browser that requested the iso 8859-5 format
of the html document follows a link and sends the URL to a
web server, it will encode the URL using the transport format
(that is using UTF-8). The server will decode the UTF-8 and convert
it into iso 8859-3, if that is the set used on the server.
No problem here with 8-bit byts.

The difficulty is as it is today when no defined handling of non
ascii characters in URLs exist. Either they must be in %XX form or
transcoding must not occur and octets in URLs must not be changed.
It is this mess we want to remove by defining UTF-8 as the transport
format for URLs. By doing that there is a defined way to use most
characters in the world in a URL. The UTF-8 encoded URL can be, by
%XX encoding, both transported over 7-bit media or printed on paper so
that many people in the world can enter it on a keyboard. And it can
be presented in local character set to make it user friendly.

The quicker we can change to a standard way to represent non ascii
in the transport format of a URL, the quicker the current problems
will go away.

I know Masataka prefers iso 2022, but of what I have seen, most are
planing to support ISO 10646/Unicode and not ISO 2022. As time goes
on, added information in documents will probably fix the problems
Masataka sees in UCS.

Regards,

        Dan