- From: Keith Moore <moore@cs.utk.edu>
- Date: Wed, 16 Aug 1995 18:32:42 -0400
- To: Martin J Duerst <mduerst@ifi.unizh.ch>
- Cc: moore@cs.utk.edu (Keith Moore), uri@bunyip.com
> >> The problems that remain before we can find a good solution > >> seem to be the following: > >> > >> 4) Not the URL scheme itself, but the IETF URLs (FTP, HTTP,...) > >> are responsible for introducing semantics. > >> (Jon Knight, Keith Moore) > > > >This isn't what I meant. I meant that the protocols and file servers > >used by these URLs enforce the notion that a URL is tightly coupled to > >the filename, and filenames tend to be meaningful. > > Sorry for being inprecise. In any case, the bottom line of this argument > is "they spoiled it, so let them fix it". And I tried to address that in > several of my proposals. Still, as they all spoiled it in the same way, > and because of the same reasons, I guess it would be useful if > we could develop a common suggestion for a way out. That's fine. HTTP servers, at least, are fairly free to work out whatever mapping they wish between URLs and filenames. (It's less clear about whether FTP servers can do this). > >But you completely left out the hardest problems to solve: > > I left out these problems because I thought I had addressed them > in enough detail in previous posts so as to show that they are not > that hard (if they are problems at all). Sorry, I didn't respond to your previous post where you attempted to refute my arguments. I'll respond now. > >5) Everybody knows how to type the ASCII letters and digits (some > >better than others), but otherwise, most people do not know how to > >type characters that aren't used by some language that they're > >familiar with. > > I do not assume that anybody writing an English document is giving > it a Japanese document name (and thus, for some protocols, a > HRI containing some Japanese). In some posts, I assumed this > implicitly, and in other I have said it explicitly. And I am going > a little further into details here. > I assume that anybody in Japan writing an English document for > a worldwide public will, with just a little bit of common sense, > and at least on his/her second try if not on the first, understand > that that document should have a HRI that can be typed by somebody > that understands only English. If you want an HRI that is only available to others in the same country, this is fine. I'm thinking in terms of worldwide interoperability. My understanding is that there is a bit more uniformity in Japan about which character sets are in use, than there are in other parts of the world. But just taking Japanese as an example, what happens if a Japanese-fluent person in the US wants to read a Japanese document that has a Japanese HRL? He can read Japanese but perhaps doesn't have a Japanese keyboard. How is he going to type in the HRL? It's my understanding that there are many countries in which there is little or no uniformity from one site to another (or even sometimes within a particular site) as to what character sets are used. There are also groups of countries that share a language, but have different conventions for what character set to use within that country. > >6) URLs expressed in character sets besides ASCII are more vulnerable > >to translation to other character sets (say ISO 8859/1 to ISO 636-XX) > >which make the URL invalid. This translation WILL occur as the result > >of URLs being mailed around, copied from one application to another, > >or being printed on paper in one environment and typed in again in a > >different environment that uses a different charset. > > You have mentionned this problem before, and I have explained why > I think it is not a real problem in previous mails, but I will repeat > my arguments here: > > There are vulnerabilities in all character sets (e.g. the ASCII/EBCDIC > problem for present URLs), but the vulnerability may indeed be said > to be lower for ASCII than for things that go beyond. > Nevertheless, these vulnerabilities are now very well understood, > and satisfactory solutions are available and installed. With respect > to ISO 8859/1, a HTML web document will display Western European > accents correctly on hardware such as Mac, PC, and NeXT, even though > they use different native encodings of accented characters. Yes, they'll display the same, but does that mean that the users know how to type them in? What happens if you copy a section of text from the web browser (which understands 8859/1) to another application that understands a Mac charset or a PC code page? > Also, in Japan, News and Email is working among a wide variety > of platforms although on various platforms, 3 different encodings > of the same character repertoire are in use. Apart from issues > such as those we have with '1', 'I', and 'l', which Japanese > are aware of as well as we are aware of them, there are no more > problems today having a Japanese text entered into a system, > transmitted to another, copied to paper from the screen by hand > or printed out, and entered again. Otherwise, Japanese information > processing would indeed by in a very bad state. The same applies > for other areas. I'm glad to hear that things are going so well in Japan, but I'm told that things are not so nice in other areas. > I don't object to having two levels of indirection, not at all. But > as I think experience with URLs has shown, we can't just care for > the lower level and assume that somebody else will care for the > upper level. What we as humans want to see and use is > human-meaningful names, and we will create them on the lower level > if the upper level is missing. I agree with that statement. It's not sufficient to simply say "this must be solved by the upper layer". We must BUILD the upper layers. On the other hand, it will be up to Japanese speakers to build upper layers (search services that map titles to URLs or URNs) that understand Japanese. I don't see any reason why these can't be built now to link Japanese titles to URLs, and modified later to link Japanese titles to URNs. > >> If the semantics indeed change, e.g. if I move to another university > >> or company, then there is no need for any such URL anymore, > >> other than historical. > > > >No, that's not true. If people want to find information about *you*, > >as opposed to where you work, they need a document identifier that > >continues to point to that information about you even after you change > >jobs. So the identifier really needs to be independent of where you > >work. > > I agree. But there is always a higher level of semantics that you can > construct. For western culture this might be a little far-fetched, > but assume if you died, you would like somebody to find information > about you in your next life. Sure. Or your descendants, or what other people had done with your creative work. > So of course I have to decide whether > I want to have a description of *me* or of me at the present > workplace, and I might have difficulties of assuring a long life > for the first case with the present technology, but this doesn't > change the fact that an URL such as > http://www.ifi.unizh.ch/staff/Martin_Duerst.html, > when viewed semantically, in fact expresses that this is about > me as a staff member somewhere. Right. Depending on your point-of-view, this is either a bug or a feature. > If somebody expects this > to continue to exist while knowing that I am no longer working > there, (s)he makes assumptions that the form of the URL > was designed to discourage. maybe not "designed to discourage" so much as simply "wasn't designed to encourage". > >Fortunately, (a) EBCDIC systems are a vanishing breed, a (b) there is > >(by design) only one possible translation between EBCDIC and ASCII for > >the characters used in URLs. The reason that mailing a URL from an > >ASCII world to an EBCDIC world doesn't cause problems is that URLs > >WERE DEFINED SO THAT IT WOULD NOT BE A PROBLEM. You're proposing to > >undo that careful bit of engineering, not only for ASCII-EBCDIC but > >for every charset in use in the world. > > The many local encodings besides Unicode/ISO 10646 will > most probably be a vanishing breed in the future. The jury is still out on Unicode. We've got at least one of almost every major computer system here, but I don't know of a single one of them that supports Unicode in its current release. (pre-released products don't count) There were also people who said for many years that OSI protocols would take over the world. I'll believe it when I see it. > What I was trying with my proposals was just to define HRIs (or > whatever you call it) so that there would not be a problem, > or not more problems than with ASCII/EBCDIC, in the sense > that wherever text is translated or transcibed correctly, the > URLs go with it. > If you see any specific problems of why any of the schemes I have > proposed will fail in this respect (and not just the general concerns > "they are more volatile" and "it's a difficult business"), I would > be very glad to hear that. I've mentioned the specific problems I see, and I think they're pretty serious. If you assume that they're transcribable, or you limit the domain of applicability of HRLs to environments where they are transcribable, you might be able to address the rest of the concerns. > >As an alternative to needing a directory linking human-readable > >identifiers to URLs (since we're nowhere close to having one > >deployed), I propose that there be a standard form of reference to a > >URL that also includes a document title (or more than one) in a > >non-ASCII charset. That is, instead of having multiple encodings for > >the location of a resource, embellish them with human-readable names > >for the resource. These could get passed around in html documents, > >preserved in hotlists, and printed on business cards. For instance: > > > ><A HREF="http://foo.bar/zot" TITLE="[unicode]some-random-string"> > > > >Of course, if the user types in the URL by hand, he still ends up > >typing it in ASCII. But which is worse, having to type in the URL > >using unfamiliar characters, or having a situation where most people > >in the world cannot access your document because the URL is in a > >charset that they can't use at all? > > Again, as said above, I don't care about "most people in the world", > I care about "most people that will be able to read and make use > of the document". Okay, but be careful about assuming that things are as nice everywhere as they are in Japan, and make sure you think about speakers of a particular language living outside of the nice environment for that language -- sometimes there are large numbers of these. > Otherwise, your proposal is certainly worth considering, but it has > two problems: > First, is it lengthy, which is especially unconvenient for business cards > and such. Most of what I have proposed is considerably shorter. This can be dealt with. If the HRL isn't based on an actual file name, and the HTTP server performs a mapping, the machine readable portion of the HRL can be any name which is sufficiently long to make it unambiguous. So it could simply be a short string of letters or digits. If there were a million different files on a server, you could name them all using only six digit characters. Six alphanumeric characters designed to be transcribable (digits 0-9 plus upper case letters A-Z excepting I, O, and Q) would be 33**6 different names -- practically infinite space. So when comparing an HRL that used *only* human-readable characters, with one that used human-readable characters for display and an ASCII subset for actually accessing the object, the additional overhead would appear to be only a few characters: the six characters for machine readable use, and the characters used to separate the two parts. Format them like so: http://dom.ain/[document-id]/human-readable-string and just teach the http servers to ignore anything following the document-id (enclosed in brackets or some such as a hint to the http server). Now all you need is tools to check in new documents, assign unique document-id's to them, and link or copy the documents into the right directory on the server. One nice thing about this notation is that even if it's mis-transcribed (in the human-readable-string), it still works. Though there's still some effort required to teach people that they can ignore the human-readable-string if they want to. Being realistic, this won't help people to remember a particular URL. But it will help people to be able to look at a URL and recognize what it is. It will also allow people to mis-label URLs: they can say "this is a picture of baby ducks" when it is really kiddie porn. But I suppose they can do this already in English, why not extend this capability to the rest of the world?. > Second, the "[unicode]" prefix is not exactly needed. If the > "some-random-string" is in the same encoding as the document as a > whole, then there is no need to explain the encoding. Maybe, maybe not. If you want the things to survive being emailed around, I'd still recommend that you encode things in ASCII and include a charset name (similar to RFC 1522). Keith
Received on Wednesday, 16 August 1995 18:33:22 UTC