- From: Keith Moore <moore@cs.utk.edu>
- Date: Wed, 09 Aug 1995 15:05:09 -0400
- To: Martin J Duerst <mduerst@ifi.unizh.ch>
- Cc: moore@cs.utk.edu (Keith Moore), FisherM@is3.indy.tce.com, uri@bunyip.com
> >I understand why people think it's a good idea, but I think it's > >not possible in general to solve this problem. There is a fundamental > >conflict between the desire to be able to input URIs from a keyboard, > >and the desire to be able to make URIs be "meaningful" to humans. > >If you try to accomodate more character sets, you compromise the former. > > I do not question the need for (a form) of an URI that consists only of > the most limited character set, for input with simple keyboards, etc. > But I think we need more. My point is that by trying to "solve" this problem, you actually end up making things far worse than they were. The web already has enough problems with links not working due to files that have moved, been renamed, hosts that have been renamed, etc., without having to deal with URLs that were misspelled because of character set problems. The only thing that you gain from multilingual URLs is that they look nice on a screen or on paper or on a business card. And this comes at a tremendous cost, because when someone tries to type in what they think they see on a screen or business card, it will frequently translate into some other sequence of octets that gets presented to the ftp or http server. This can happen for a wide variety of reasons: there are dozens of different charsets in use, charset translation tables aren't invertable, there are often several different sequences of octets to "spell" a particular character, and people don't know how to type those wierd (to them) characters anyway. > I understand what you are saying, but what the world at large currently > sees in terms of URLs is different. Every company is trying to get > a nice domain name; there are even companies who do their > business by organising such names for others. And every Webmaster > is trying to make the URLs, esp. for entry points, easily recognizable > and memorizable. Anything else is very bad marketing indeed. Yep, it's indeed a problem. Until there are better tools, people are going to try to make URLs that are meaningful. But domains aren't going to become non-ASCII, and neither will URLs -- for the same reason. People who try to do this with their own URLs will only succeed in making it harder for other folks to access their sites. People who build multilingual URL support into their net browsers will only end up making them harder to use. > > >This same argument surfaces from time to time in the email world. > >People want to use their real names as email addresses, and I don't > >blame them. But the fact is that most people can't properly type in > >a Japanese, Chinese, Korean, Hebrew, Russian, etc., name if they > >don't themselves read Japanese, Chinese, Korean, Hebrew, Russian, etc. > > Email addresses are not the problem. Actually, for a mailto: URL, > RFC 1522 provides a nice way to include your name, it looks > like this: > <mailto: mduerst@ifi.unizh.ch (=?ISO-8859-1?Q?Martin_J=2E_D=FCrst)> Yes, I'm familiar with 1522. The problem I was referring to is when people want the left-hand side of the @ sign to be their login name, or whatever name is used on their LAN mail system, which happens to be in some non-ASCII character set. Those people either have to get an ASCII email address or do without email contact to the rest of the world. It's really no different than people insisting on meaningful telex addresses or meaningful phone numbers. Any worldwide address needs to be in a universal, widely available, character set. > >In either case, what we're going to end up with is a non-obvious > >mapping between the (human-meaningful) "local" version of a name, and > >the (transcribable) one that is used when talking to the outside > >world. The best we can do is to build tools that help us manage this > >mapping. > > We already have this. A Chinese file name, encoded in URL with lots > of %HH, already has these two forms. One is the one with the %HH, > the other is where these are resolved, and when displayed in the > corresponding Chinese environment. The problem is that a) we don't > call the readable one an URL, and b) for both sides, we don't have a > clue (or not much of a clue, anyway) what the mapping is. Right. My point is that things are just going to go more in this direction. Even though it's ugly, it's the best solution (and also the path of least resistance). Another reasons that I think things will go in this direction is that we need to solve the "bad link" problem. One of the big reasons for links becoming stale is that we want to use file naming hierarchies to help us organize our files. But this conflicts with the need to produce stable identifiers for use by the outside world, because you have to reorganize hierarchies once in awhile. So we're going to need some layer that maps between external names (whether they be "URNs" or "stable URLs") and local names (filenames) to provide stability. That same layer can also provide charset mapping with little additional cost, and without breaking people's ability to type URLs. > >And there is a strong argument that (human-meaningful) names and > >(machine-meaningful) addresses should be kept separate anyway. Make > >the document titles human meaningful, let's build search services that > >understand various character sets, and let the search services resolve > >into pure-ASCII URIs. > > I have no problem with that if you restrict URIs in such a way (e.g. > just allowing numbers or such) that even in the English-speaking > part of the world, there is no danger that builders of search > services think that the URL contains meaningful information. > Currently they do, and that's one of the reasons we are thinking > about the problem at hand. I don't think we can restrict URLs in that way, but there is a significant group of people who think URNs should be opaque to ordinary users (like ISBNs are now). So maybe we can solve the problem for the next generation anyway. Keith
Received on Wednesday, 9 August 1995 15:06:55 UTC