Re: URL internationalization! from Dan Oscarsson on 1997-02-25 (uri@w3.org from February 1997)

From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Date: Tue, 25 Feb 1997 14:09:38 +0100 (MET)
To: alb@sct.gouv.qc.ca, mduerst@ifi.unizh.ch
Cc: yergeau@alis.com, fielding@kiwi.ICS.UCI.EDU, uri@bunyip.com
Message-Id: <199702251309.OAA06854@valinor.malmo.trab.se>

ä

There are a few more things to think about 8 bits versus %XX.

> > [given 8 bit per byte encoding]
> > 
> > >Right.  In fact, not only the system MUST NOT crash, but it SHOULD behave
> > >the same as if it had received the corresponding %XX.
> As an example,
> let's take a resource name with a G with breve (U+011E). Let's
> assume that on the server, resource names are encoded in iso-8859-3.
> Then the G with breve contains appears as %AB in a well-formed
> URL. Now suppose somebody put that URL into an HTML document
> that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains
> the octet 0xAB for the G with breve character), and that that
> document is correctly tagged as iso-8859-3.
> 
> Now assume a browser sends a request with
> 	Accept-Charset: iso-8859-5
> The server (or a proxy) translates the whole document from
> iso-8859-3 to iso-8859-5 to honor the request of the browser.
> The G with breve gets changed to 0xD0. The client receives
> the 0xD0. If it "behaves the same as if it had received the
> corresponding %XX", i.e. %D0, the URL will not work at all.

As Martin points out there are a few problems, they exist mostly
because the URL used today does not define a how to encode characters
outside ascii.

If we define that an URL that is sent using the transport format of
an URL with all characters encoded using UTF-8, it is no problem.
In a html document the URL can be represented using 8-bit octets
encoded in iso 8859-3 as of above. When the document is transcoded
(are there any servers that do that?) the URL is changed into the
same URL, but encoded in iso 8859-5 (the G with breve is still a
G with breve). When the browser that requested the iso 8859-5 format
of the html document follows a link and sends the URL to a
web server, it will encode the URL using the transport format
(that is using UTF-8). The server will decode the UTF-8 and convert
it into iso 8859-3, if that is the set used on the server.
No problem here with 8-bit byts.

The difficulty is as it is today when no defined handling of non
ascii characters in URLs exist. Either they must be in %XX form or
transcoding must not occur and octets in URLs must not be changed.
It is this mess we want to remove by defining UTF-8 as the transport
format for URLs. By doing that there is a defined way to use most
characters in the world in a URL. The UTF-8 encoded URL can be, by
%XX encoding, both transported over 7-bit media or printed on paper so
that many people in the world can enter it on a keyboard. And it can
be presented in local character set to make it user friendly.

The quicker we can change to a standard way to represent non ascii
in the transport format of a URL, the quicker the current problems
will go away.

I know Masataka prefers iso 2022, but of what I have seen, most are
planing to support ISO 10646/Unicode and not ISO 2022. As time goes
on, added information in documents will probably fix the problems
Masataka sees in UCS.

Regards,

        Dan

Received on Tuesday, 25 February 1997 08:10:04 UTC