Message-Id: <9508101935.AA08382@mocha.bunyip.com> Subject: Re: Globalizing URIs To: firstname.lastname@example.org (Paul Hoffman) Date: Thu, 10 Aug 1995 21:35:44 +0200 (MET DST) Cc: email@example.com, firstname.lastname@example.org In-Reply-To: <email@example.com> from "Paul Hoffman" at Aug 10, 95 09:19:36 am From: Martin J Duerst <firstname.lastname@example.org> I note aside: I have been made aware that uri-wg is kind of not existing anymore. Still, it seems like this is the only place where this problem can be discussed. What is the correct proceadure? >I suggest that those most interested in making the stuff to the right of >the domain name more readable simply propose a good way to do it and >publicize it. It doesn't have to be an IETF standard: if everyone on the >server side is doing it, the clients will soon follow. One of the original >suggestions (adding a [XYZ] to the end of the the URL) sounds OK to me, as >does putting the [XYZ] after the first / after the domain name. There will >be heavy technical problems with either of these for many existing servers >of many protocols, but they can be gotten around if the market demands it. So here are some proposals (this is a collection of what has appeared before on the html-wg list). After some discussion and refinement, I could very well immagine that some of the proposals in this mail become part of an internet-draft. For easier presentation, I assume that uppercase letters are not part of ASCII, but have to be encoded specially: A) Add information about the character set used somewhere in the URL The concrete forms that have surfaced where (in their "nice" form): A1) <[ISO-8859-1]http://xxx.yyy.zz/AA/BB/CC.html> A2) <http:[ISO-8859-1]//xxx.yyy.zz/AA/BB/CC.html> A3) <http://xxx.yyy.zz/[ISO-8859-1]AA/BB/CC.html> A4) <http://xxx.yyy.zz/AA/BB/CC.html[ISO-8859-1]> A5) <http://xxx.yyy.zz/AA/BB/CC.html;ISO-8859-1> These cases differ in that A1), A2), and probably A4) and A5) show that they apply (potentially) to the whole URL. A2) and especially A1 are easily separable from the URL without confusion, and can be understood as an addition outside the URL proper on how to input the URL (e.g. by selecting from a menu in the "enter URL" dialog). A3) has a particularly easy implementation in most cases, in that a link from directory "[ISO-8859-1]AA" to "AA" is created. I.e. even if the client-side tool has no idea about this scheme, and the user doesn't realize that the  part is not really part of the URL proper, this will work. Of course, there has to be some mechanism that translates between the nice form, e.g. <http://xxx.yyy.zz/[ISO-8859-1]AA/BB/CC.html> and the plain form <http://xxx.yyy.zz/[ISO-8859-1]%41%41/%42%42/%43%43.html> but this is only possible with the information between [ and ]. B) Specifying a single character set This has already surfaced in this group. Unicode/ISO 10646 is clearly the only candidate. The problem is that raw USC-2 Unicode is not really suited, it would make the plain form of our example <http://xxx.yyy.zz/%00%41%00%41/%00%42%00%42/%00%43%00%43.html> whereas in the fragment <aa.html> (plain), aa would denote the Han ideogram U+6161, with heavy backwards incompatibilities. But there are other encodings of Unicode. In particular, UTF-7 and UTF-8 have to be examined. UTF-8 leaves ASCII untouched, and encodes the rest of Unicode with two or three octets that have their MSB set. Thus it can be directly recommended. The main disadvantage is that e.g. a single Japanese character expands to nine ASCII characters in the plain representation (three octets with MSB set -> three triples of the form '%HH"). UTF-7 is similar to BASE64, but designed for short stretches. It uses the '+' to start a sequence of characters outside ASCII. This sequence is terminated with '-', and a single '+' is represented as '+-'. For using UTF-7 directly, it is now too late. A possibility is to combine it with the '%' URL escape, i.e. to start with "%+" instead of '+', with no need for "%+-". This has the advantage that the use of UTF-7 is very clearly identified, and that there is no danger of confusion. The problem is that clients not aware of the new "%+" syntax will produce an error. The plain form will be shorter than for UTF-8, especially if more than one successive character are encoded together, because no additional %HH-encoding is necessary. In the limit, it is N*5/2+3 instead of N*9. These proposals, whether UTF-7 or UTF-8 or something else, are very useful as a suggestion for where to head at, but unfortunately they are not sufficient for the present-day reality, because most file systems do not use these encodings (yet), and servers are not (yet) constructed to do the necessary translations. So a proposal from B) has to go together with a proposal from A) or C). C) Server serving information on plain <-> nice mapping An anyalysis of the URL schemes most in need of a solution has shown that they coincide with the schemes that describe documents serving protocols (ftp, http, gopher). Other schemes are less needy ("mailto" has RFC1522, "news" explicitly only allows ASCII in group names). So some kind of protocol extension may be a solution. If you have a plain form of an URL, and want to find its nice form, or vice versa, just ask the server (which is defined by the scheme and the host info, which is itself ASCII only) about the encoding (MIME-registered "charset" parameter) that defines the mapping between these two URL forms. It looks like we need difficult-to-introduce protocol extensions for this to work, but there is a simpler solution. We define a single filename, for examlpe "filename.enc" (note the 8.3 scheme). The only thing the server administrator has to do is create such a file in the central directory, consisting of a single line such as ISO-8859-1 or whatever appropriate. Only the client has to implement additional functionality, which it needs anyway. Besides the small risk of a filename conflict, which can be further reduced by choosing a different name, the main problem is that there are cases where different resource names on the same server use different encodings (e.g. because the same server contains UNIX files, as well as backups for a Mac or so). For this, the file can be changed to a multiline format, or each directory can contain a file of that name (with an entry in the parent directory that indicates that the subdirectory should be examined). Also, the proposals A) and C) do not care for the possibility that different parts of an URL use different character encodings, but this would cause big confusion even locally. To ease the transition between the use of e.g. C) and the use of a solution from B), the convention could be introduced that servers that use B) change the conventional first component of the domain name from "www" or "ftp" to e.g. "utf7-www". This has been a long list of proposals, and I would greatly appreciate any comments regarding their feasibility, not mainly in general, but for specific points such as errors and oversigths. Many thanks in advance, Martin.