Date: Tue, 25 Feb 1997 15:02:57 +0100 (MET) From: "Martin J. Duerst" <firstname.lastname@example.org> To: Dan Oscarsson <Dan.Oscarsson@trab.se> Cc: email@example.com, firstname.lastname@example.org, fielding@kiwi.ICS.UCI.EDU, Subject: Re: URL internationalization! In-Reply-To: <199702251309.OAA06854@valinor.malmo.trab.se> Message-Id: <Pine.SUN.3.95q.970225143447.245G-100000@enoshima> On Tue, 25 Feb 1997, Dan Oscarsson wrote: > There are a few more things to think about 8 bits versus %XX. > > > > [given 8 bit per byte encoding] > > > > > > >Right. In fact, not only the system MUST NOT crash, but it SHOULD behave > > > >the same as if it had received the corresponding %XX. > > As an example, > > let's take a resource name with a G with breve (U+011E). Let's > > assume that on the server, resource names are encoded in iso-8859-3. > > Then the G with breve contains appears as %AB in a well-formed > > URL. Now suppose somebody put that URL into an HTML document > > that is encoded in iso-8859-3, in 8-bit form (i.e. the URL contains > > the octet 0xAB for the G with breve character), and that that > > document is correctly tagged as iso-8859-3. > > > > Now assume a browser sends a request with > > Accept-Charset: iso-8859-5 > > The server (or a proxy) translates the whole document from > > iso-8859-3 to iso-8859-5 to honor the request of the browser. > > The G with breve gets changed to 0xD0. The client receives > > the 0xD0. If it "behaves the same as if it had received the > > corresponding %XX", i.e. %D0, the URL will not work at all. > > As Martin points out there are a few problems, they exist mostly > because the URL used today does not define a how to encode characters > outside ascii. > > If we define that an URL that is sent using the transport format of > an URL with all characters encoded using UTF-8, it is no problem. > In a html document the URL can be represented using 8-bit octets > encoded in iso 8859-3 as of above. When the document is transcoded > (are there any servers that do that?) Yes, there are! Gavin or Francois sure can give examples. > the URL is changed into the > same URL, but encoded in iso 8859-5 (the G with breve is still a > G with breve). When the browser that requested the iso 8859-5 format > of the html document follows a link and sends the URL to a > web server, it will encode the URL using the transport format > (that is using UTF-8). The server will decode the UTF-8 and convert > it into iso 8859-3, if that is the set used on the server. > No problem here with 8-bit byts. Exactly. That's the core of "stage two" of my proposal. Everything will work as expected, exactly as it already does for ASCII and EBCDIC at the moment. I very much understand Dan that he would like to go to that stage immediately. However, I decided to separate my proposal in two stages, and am currently asking for "stage one" only, because of the following reasons: - It is important that the convention to use UTF-8 (with %HH) gets sufficiently deployed before we seriously start to put URLs into HTML and such in native encoding. - With mandating %HH, we are exactly parallel (except for the backwards compatibility issues) with URNs. - The URN discussion has shown that many people are still sceptical about the correct treatment of non-ASCII characters upon transcoding, cut-and-paste, and so on. I didn't want to repeat this discussion, I think time will show. Of course, if the people that have opposed native non-ASCII encoding in the URN discussion are already convinced to the contrary, I wouldn't have any problems moving ahead (Keith, any comments :-?). - Because sending around natively encoded URLs is already established practice (the syntax draft, with due right, contains a warning in this direction), even formally specifying that only the "canonical form" is allowed (i.e. %HH-escaping is mandated) will not be enforcable. - "stage one" is completely independent and separate from "stage two". There is no technical need to move to "stage two" if we don't agree to do so. - On the other hand, "stage two" is the natural consequence of "stage one", in the sense that (at least that's my prediction) once UTF-8 is seriously established for URLs, their native encoding, without %HH, will get deployed quickly. As a browser maker, I would definitely want to provide that feature to my users! Regards, Martin.