- From: Dan Oscarsson <Dan.Oscarsson@trab.se>
- Date: Tue, 15 Apr 1997 15:50:11 +0200 (MET DST)
- To: uri@bunyip.com, fielding@kiwi.ICS.UCI.EDU
- Cc: Harald.T.Alvestrand@uninett.no
> PROBLEM 1: Users in network environments where non-ASCII characters > are the norm would prefer to use language-specific characters > in their URLs, rather than ASCII translations. > > Proposal 1a: Do not allow such characters, since the URL is an address > and not a user-friendly string. Obviously, this solution > causes non-Latin character users to suffer more than people > who normally use Latin characters, but is known to interoperate > on all Internet systems. Well, Swedish letters like едц are normally called Latin, but I assume you mean ascii. This proposal is NOT acceptable. It is very important that URLs are user-friendly. It will also make Java impossible to use over the webb! Java allows non ascii in variable names and types (at least one language that is international!) and this means that it must be possible to fetch non ascii Java classes over the webb, if Java is going to work. > > Proposal 1b: Allow such characters, provided that they are encoded using > a charset which is a superset of ASCII. Clients may display > such URLs in the same charset of their retrieval context, > in the data-entry charset of a user's dialog, as %xx encoded > bytes, or in the specific charset defined for a particular > URL scheme (if that is the case). Authors must be aware that > their URL will not be widely accessible, and may not be safely > transportable via 7-bit protocols, but that is a reasonable > trade-off that only the author can decide. If the URL is %xx encoded it works over 7-bit transports. Also the URL is accessible from any place where ascii can be used as every URL can be encoded using the %xx encoding which gives an ascii only encoding of the URL. I dislike this proposal as the non ascii characters use an undefined character set and therefore a client cannot know how to interpret the characters and display the correctely in the local character set. > > Proposal 1c: Allow such characters, but only when encoded as UTF-8. > Clients may only display such characters if they have a > UTF-8 font or a translation table. Servers are required to > filter all generated URLs through a translation table, even > when none of their URLs use non-Latin characters. Browsers > are required to translate all FORM-based GET request data > to UTF-8, even when the browser is incapable of using UTF-8 > for data entry. > raw bits. The server would be required to interpret all URL characters > as characters, rather than the current situation in which the server's > namespace is distributed amongst its interpreting components, each of which > may have its own charset (or no charset). Even if we were to make such > a change, it would be a disaster since we would have to find a way to > distinguish between clients that send UTF-8 encoded URLs and all of those > currently in existence that send the same charset as is used by the HTML > (or other media type) page in which the FORM was obtained and entered > by the user. I think you are missing one important thing. The UTF-8 encoded URL is a transport format, if a URL is embedded within a iso 8859-1 encoded html document, the URL is encoded using iso 8859-1. A URL should only be encoded using a well defined character set like the UTF-8 encoding when transmitted in a protocol that says that a URL is part of the protocol, when a URL is embedded in someting else, like a html document, printed on paper, displayed on a screen, the URL should be encoded using the same character set that the object it is embedded in, is using. As a browser knows the character set used in a html document, it can easily translate the URL from, for example iso 8859-1, to UTF-8 for transmission in the protocol. When a URL is sent embedded in a html document (or in a form) it should use the same encoding as the document or the form. So I do not think the problem is as great as you said, most CGI script can work as before as internal URLs are of the same character set as the document generated. But some of the libraries for decoding incoming URLs used by CGI scripts will have to be changed, and many products must learn to separate local (native) character set and transport character set. Many already partially does that, for example Netscape for Mac uses the Macintosh character set for display of html documents even though the transport format of the is iso 8859-1. - If we cannot find a way to send URLs containing any character in a way so that the characters can be understood and displyed in a user friendly manner, the web and URLs are not the future. Dan -- Dan Oscarsson Telia Engineering AB Email: Dan.Oscarsson@trab.se Box 85 201 20 Malmo, Sweden
Received on Tuesday, 15 April 1997 09:51:25 UTC