Message-Id: <9508161054.AA02294@mocha.bunyip.com> Subject: Re: Globalizing URIs To: firstname.lastname@example.org (Keith Moore) Date: Wed, 16 Aug 1995 12:54:35 +0200 (MET DST) Cc: email@example.com, firstname.lastname@example.org, email@example.com In-Reply-To: <199508142110.RAA19670@wilma.cs.utk.edu> from "Keith Moore" at Aug 14, 95 05:09:58 pm From: Martin J Duerst <firstname.lastname@example.org> Keith Moore responded to my posting: >> First, I want to thank for the support that quite many of the >> contributors have given. Although some of the quotations >> and summaries below might be somewhat out of context, it >> clearly shows that the view "URLs are not supposed to be >> meaningful, so there is no problem" is not justified: > >I believe you've misstated several of these arguments, but I'll let >their authors respond. I'm happy to tell you that I have not received any such messages up to now. >> The problems that remain before we can find a good solution >> seem to be the following: >> >> 4) Not the URL scheme itself, but the IETF URLs (FTP, HTTP,...) >> are responsible for introducing semantics. >> (Jon Knight, Keith Moore) > >This isn't what I meant. I meant that the protocols and file servers >used by these URLs enforce the notion that a URL is tightly coupled to >the filename, and filenames tend to be meaningful. Sorry for being inprecise. In any case, the bottom line of this argument is "they spoiled it, so let them fix it". And I tried to address that in several of my proposals. Still, as they all spoiled it in the same way, and because of the same reasons, I guess it would be useful if we could develop a common suggestion for a way out. >But you completely left out the hardest problems to solve: I left out these problems because I thought I had addressed them in enough detail in previous posts so as to show that they are not that hard (if they are problems at all). >5) Everybody knows how to type the ASCII letters and digits (some >better than others), but otherwise, most people do not know how to >type characters that aren't used by some language that they're >familiar with. I do not assume that anybody writing an English document is giving it a Japanese document name (and thus, for some protocols, a HRI containing some Japanese). In some posts, I assumed this implicitly, and in other I have said it explicitly. And I am going a little further into details here. I assume that anybody in Japan writing an English document for a worldwide public will, with just a little bit of common sense, and at least on his/her second try if not on the first, understand that that document should have a HRI that can be typed by somebody that understands only English. Of course, if out of this discussion there should emerge something like a standard, I will take care that the above assumption is explicitly mentionned, but I guess that even without that, there is a much bigger chance that users comply to this than that they comply to such wishful thinking as "URLs shouldn't contain semantics", even if this is very explicitly mentionned in the RFCs. Having the language of the document and the language of its name/ description coincide indeed increases the self-descriptive semantics, but this is something human beings have little problems with. The above discussion excluded some special cases, such as the case that I give a Japanese name to an English document if I want to withhold it from people that are not able to type in Japanese, or the case that a user not speaking Japanese wants to view a Japanese document just for the fun of it. (Haven't seen enough $ today? Go and have a look at a Japanese document. In many cases, when displayed with a tool that knows nothig about Japanese encodings, you will see many '$' on your screen.) Somewhat more serious examples include the case of a beginner's course in Japanese for foreigners (where at least the names for the first few lessons should be in Japanese) and cases where you, not being able to type Japanese, are asked to fetch and copy a document with a Japanese name for somebody that will be able to read it. But I guess we can assume that everybody is connected to the web anyway, don't we? >6) URLs expressed in character sets besides ASCII are more vulnerable >to translation to other character sets (say ISO 8859/1 to ISO 636-XX) >which make the URL invalid. This translation WILL occur as the result >of URLs being mailed around, copied from one application to another, >or being printed on paper in one environment and typed in again in a >different environment that uses a different charset. You have mentionned this problem before, and I have explained why I think it is not a real problem in previous mails, but I will repeat my arguments here: There are vulnerabilities in all character sets (e.g. the ASCII/EBCDIC problem for present URLs), but the vulnerability may indeed be said to be lower for ASCII than for things that go beyond. Nevertheless, these vulnerabilities are now very well understood, and satisfactory solutions are available and installed. With respect to ISO 8859/1, a HTML web document will display Western European accents correctly on hardware such as Mac, PC, and NeXT, even though they use different native encodings of accented characters. Also, in Japan, News and Email is working among a wide variety of platforms although on various platforms, 3 different encodings of the same character repertoire are in use. Apart from issues such as those we have with '1', 'I', and 'l', which Japanese are aware of as well as we are aware of them, there are no more problems today having a Japanese text entered into a system, transmitted to another, copied to paper from the screen by hand or printed out, and entered again. Otherwise, Japanese information processing would indeed by in a very bad state. The same applies for other areas. So interoperability remains assured among those people and environments where it is really necessary and useful. And we don't have to care for the rest; an abstract "interoperability" doesn't help anyone that will never be able to read the concerned documents. >> As far as what appears on namecards, in newspapers, and otherwise >> in human-readable form, it is preferably something semantically >> meaningful, with the "location" part of it eliminated. >> It is the location part, and not the semantic part, that is >> affected when the resource is renamed or relocated. > >You want resource names that aren't tied to either location or >meaning. The location can change over time, and so can the meaning >that we assign to something. You need two levels of indirection: >human-meaningful names -> resource name -> locations. The meaning, in the sense of "what does that document mean to me", of course might change, e.g. from a nice curiosity to something extensively studied to something absolutely boring. But the meanig, in a more abstract sense, i.e. that it is about a certain person, a certain concept, product, or company, shouldn't change, or else the resource name really should be changed, too, or the reader will be very surprised indeed. I don't object to having two levels of indirection, not at all. But as I think experience with URLs has shown, we can't just care for the lower level and assume that somebody else will care for the upper level. What we as humans want to see and use is human-meaningful names, and we will create them on the lower level if the upper level is missing. >> If the semantics indeed change, e.g. if I move to another university >> or company, then there is no need for any such URL anymore, >> other than historical. > >No, that's not true. If people want to find information about *you*, >as opposed to where you work, they need a document identifier that >continues to point to that information about you even after you change >jobs. So the identifier really needs to be independent of where you >work. I agree. But there is always a higher level of semantics that you can construct. For western culture this might be a little far-fetched, but assume if you died, you would like somebody to find information about you in your next life. So of course I have to decide whether I want to have a description of *me* or of me at the present workplace, and I might have difficulties of assuring a long life for the first case with the present technology, but this doesn't change the fact that an URL such as http://www.ifi.unizh.ch/staff/Martin_Duerst.html, when viewed semantically, in fact expresses that this is about me as a staff member somewhere. If somebody expects this to continue to exist while knowing that I am no longer working there, (s)he makes assumptions that the form of the URL was designed to discourage. >Fortunately, (a) EBCDIC systems are a vanishing breed, a (b) there is >(by design) only one possible translation between EBCDIC and ASCII for >the characters used in URLs. The reason that mailing a URL from an >ASCII world to an EBCDIC world doesn't cause problems is that URLs >WERE DEFINED SO THAT IT WOULD NOT BE A PROBLEM. You're proposing to >undo that careful bit of engineering, not only for ASCII-EBCDIC but >for every charset in use in the world. The many local encodings besides Unicode/ISO 10646 will most probably be a vanishing breed in the future. What I was trying with my proposals was just to define HRIs (or whatever you call it) so that there would not be a problem, or not more problems than with ASCII/EBCDIC, in the sense that wherever text is translated or transcibed correctly, the URLs go with it. If you see any specific problems of why any of the schemes I have proposed will fail in this respect (and not just the general concerns "they are more volatile" and "it's a difficult business"), I would be very glad to hear that. >As an alternative to needing a directory linking human-readable >identifiers to URLs (since we're nowhere close to having one >deployed), I propose that there be a standard form of reference to a >URL that also includes a document title (or more than one) in a >non-ASCII charset. That is, instead of having multiple encodings for >the location of a resource, embellish them with human-readable names >for the resource. These could get passed around in html documents, >preserved in hotlists, and printed on business cards. For instance: > ><A HREF="http://foo.bar/zot" TITLE="[unicode]some-random-string"> > >Of course, if the user types in the URL by hand, he still ends up >typing it in ASCII. But which is worse, having to type in the URL >using unfamiliar characters, or having a situation where most people >in the world cannot access your document because the URL is in a >charset that they can't use at all? Again, as said above, I don't care about "most people in the world", I care about "most people that will be able to read and make use of the document". Otherwise, your proposal is certainly worth considering, but it has two problems: First, is it lengthy, which is especially unconvenient for business cards and such. Most of what I have proposed is considerably shorter. Second, the "[unicode]" prefix is not exactly needed. If the "some-random-string" is in the same encoding as the document as a whole, then there is no need to explain the encoding. Also, the encoding is not relevant because the TITLE is not used when accessing the document over a protocol (remember that in the detailled explanation for your example of what path an URL might travel, the information in  was only used in the last step, when making a request over the protocol. There are of course cases where the characters in "some-random-string" will not be encodable in the current encoding of the containing document, but for these cases, SGML/ HTML has its own mechanisms. Regards, Martin.