- From: Dan Oscarsson <Dan.Oscarsson@trab.se>
- Date: Tue, 25 Feb 1997 14:09:38 +0100 (MET)
- To: alb@sct.gouv.qc.ca, mduerst@ifi.unizh.ch
- Cc: yergeau@alis.com, fielding@kiwi.ICS.UCI.EDU, uri@bunyip.com
ä There are a few more things to think about 8 bits versus %XX. > > [given 8 bit per byte encoding] > > > > >Right. In fact, not only the system MUST NOT crash, but it SHOULD behave > > >the same as if it had received the corresponding %XX. > As an example, > let's take a resource name with a G with breve (U+011E). Let's > assume that on the server, resource names are encoded in iso-8859-3. > Then the G with breve contains appears as %AB in a well-formed > URL. Now suppose somebody put that URL into an HTML document > that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains > the octet 0xAB for the G with breve character), and that that > document is correctly tagged as iso-8859-3. > > Now assume a browser sends a request with > Accept-Charset: iso-8859-5 > The server (or a proxy) translates the whole document from > iso-8859-3 to iso-8859-5 to honor the request of the browser. > The G with breve gets changed to 0xD0. The client receives > the 0xD0. If it "behaves the same as if it had received the > corresponding %XX", i.e. %D0, the URL will not work at all. As Martin points out there are a few problems, they exist mostly because the URL used today does not define a how to encode characters outside ascii. If we define that an URL that is sent using the transport format of an URL with all characters encoded using UTF-8, it is no problem. In a html document the URL can be represented using 8-bit octets encoded in iso 8859-3 as of above. When the document is transcoded (are there any servers that do that?) the URL is changed into the same URL, but encoded in iso 8859-5 (the G with breve is still a G with breve). When the browser that requested the iso 8859-5 format of the html document follows a link and sends the URL to a web server, it will encode the URL using the transport format (that is using UTF-8). The server will decode the UTF-8 and convert it into iso 8859-3, if that is the set used on the server. No problem here with 8-bit byts. The difficulty is as it is today when no defined handling of non ascii characters in URLs exist. Either they must be in %XX form or transcoding must not occur and octets in URLs must not be changed. It is this mess we want to remove by defining UTF-8 as the transport format for URLs. By doing that there is a defined way to use most characters in the world in a URL. The UTF-8 encoded URL can be, by %XX encoding, both transported over 7-bit media or printed on paper so that many people in the world can enter it on a keyboard. And it can be presented in local character set to make it user friendly. The quicker we can change to a standard way to represent non ascii in the transport format of a URL, the quicker the current problems will go away. I know Masataka prefers iso 2022, but of what I have seen, most are planing to support ISO 10646/Unicode and not ISO 2022. As time goes on, added information in documents will probably fix the problems Masataka sees in UCS. Regards, Dan
Received on Tuesday, 25 February 1997 08:10:04 UTC