Message-Id: <9508141502.AA06947@mocha.bunyip.com> Subject: Re: Globalizing URIs To: firstname.lastname@example.org Date: Mon, 14 Aug 1995 17:02:05 +0200 (MET DST) From: Martin J Duerst <email@example.com> Being away over the weekend and unable to answer immediately, I would first like to thank everybody for all the comments on the problem of globalizing URIs. This mail consists of three parts: A collection of supportive comments, an analysis of the critical comments, and some proposals for further action. First, I want to thank for the support that quite many of the contributors have given. Although some of the quotations and summaries below might be somewhat out of context, it clearly shows that the view "URLs are not supposed to be meaningful, so there is no problem" is not justified: Paul Hoffman: No one is paying attention to the fact that URLs should not be meaningful (with supportive data). Roy T. Fielding: Meaningful human-readable URIs are easier to correct (by humans) than long number sequences with error detection. Error correction for long number sequences is too expensive. Stu Weibel: Numbers are not suited to human processing. Jim Conklin: URLs accepted and used because they can be remembered because they intentionally contain semantics. Roy T. Fielding: Transcribability (brought by meaningful URLs) responsible for the success of the WWW. Jon Knight, Roy T. Fielding: URLs are here to stay. URNs only for resources with long "self-life". Karen R. Sollins: I am very much in favor of human friendly naming schemes, in fact lots of them. John Curran: Internet should provide equal functionality to users of all languages. The problems that remain before we can find a good solution seem to be the following: 1) Using meanig in a name increases the chance that the name will become invalid. (Karen Sollins, Keith Moore, Paul Hoffman) 2) There should be only one form of URL. (Keith Moore, Paul Hoffman) 3) There should be something human-readable, but it should/might be on another layer than URIs. (Karen Sollins, John Curran) 4) Not the URL scheme itself, but the IETF URLs (FTP, HTTP,...) are responsible for introducing semantics. (Jon Knight, Keith Moore) I would like first to adress 1), because here I still have my doubts: What is the difference between a disappearing telephone number (no semantics) and a disappearing postal address (semantics). What kind of redirecting mechanisms are there that can be applied to the former, but not to the later? What is the use of trying to achieve "ethernal life" for references when the referred resources themselves may have a short life? Isn't the problem rather that semantic and location information are mixed? For example, a hypothetical URL http://www.ifi.unizh.ch/staff/Martin_Duerst.html mixes both the information that this is a document about me, being a staff member at the "Ifi" (Institut fuer Informatik, German for Dept. of Computer Science) at the University of Zurich (ZH being the abbrevation for Zurich used on number plates of cars) in Switzerland (CH again being the abbrevation of Switzerland used on cars, standing for "Confoederatio Helvetica", the Latin name of Switzerland), and the information that this is a file with some name in some directory on some host. As far as what appears on namecards, in newspapers, and otherwise in human-readable form, it is preferably something semantically meaningful, with the "location" part of it eliminated. It is the location part, and not the semantic part, that is affected when the resource is renamed or relocated. If the semantics indeed change, e.g. if I move to another university or company, then there is no need for any such URL anymore, other than historical. Those who argue that I might want to have the same URL pointing to my new location are right insofar as for that purpose, there was too much semantics in my example. But this doesn't show that no semantics is best, it just shows that semantically meaningful names, as everything else, has to be choosen with great care. The second point above (there should only be one form of URL) is quite understandable. But first, it should be noted that even at present, there is more than one representation. The same URL is represented differently on EBCDIC systems than on ASCII systems, although it looks the same on both systems. Nobody has objected to this, but it is very similar in structure to what I described happening to a "nice" form of an URL. Now, the problem is that the second point and the thrid point don't match together very smoothly. We can't have only one representation of something if it should be both non-semantic, machine oriented, and at the same time nicely human readable. Of course, the two things don't have to be on the same implementation level necessarily, but I guess they had better be on the same "definition level". What I mean with this is that, as we have seen with URLs, just providing one thing and not the other will not work as intended. For the sake of naming, let's change the previously used terms "nice URI" and "raw URI" to URI and HRI (the later standing for Human Readable Identifier). I have no problem with changing the terminology in this way, but I would very much oppose statements such as: But, we are talking about something below that here, at that lower level, please lets distinguish who OUR clients are. They should be programs, not people. (Karen Sollins) unless there actually is a group that is caring about HRIs (of which I am not aware). Otherwise, the same as with URLs will happen, or a new proposal, with numbers only, will just not be very much used (not on namecards and such, anyway). As it has been pointed out, some people think that the individual URL schemes rather than the general design spoiled the nice intentions of "no semantics". I don't exactly agree (as these schemes just used the possibility for "spoiling" built into the general design), but I think it is a point that could help us out. The proposals I have brought forward some time ago under the letters B) (Unicode) and C) (server additions) point into the same direction. So we could say that we define a new HRI scheme with the identifier "HFTP", the property that characters from many languages may be used, and the implementation either along B), i.e. specifying some form of Unicode/ISO 10646 as the encoding to be used on the protocol, or along C), i.e. some (maybe rather trivial) server extension that allows to determine the encoding that should be used in protocol requests. This proposal is along the lines proposing something such as ftp://utf8-ftp.xxx.yyy.zz just moving the distinction to another place. Note that in this case, a canonical internal or semi-external representation (such as the %HH escapes) would not be defined. After all, it is a HRL, and therefore human readability counts. Although some people may doubt it, there is no real problem of ambiguity here. Greeks can enter Greek letters with as few errors as Americans can enter Latin (ASCII) letters, and this applies for other places around the world, too. There could of course be services implementing this new HRI schemes on top of existing schemes, with proxies and souch. This would be a start to a distinction of HRIs and URIs, a step in the right direction without doubt, and a solution to some of the current problems. Critics might argue that this is too small a step, and that it is just perpetuating problems that should be solved more generally, but anyway, there seems not that much belief in a better solution that would fully replace the existing one. Regards to all, Martin.