Re: Globalizing URIs

Keith Moore (
Wed, 09 Aug 1995 15:05:09 -0400

Message-Id: <>
From: Keith Moore <>
To: Martin J Duerst <>
Cc: (Keith Moore),,
Subject: Re: Globalizing URIs 
In-Reply-To: Your message of "Wed, 09 Aug 1995 18:37:52 +0200."
Date: Wed, 09 Aug 1995 15:05:09 -0400

> >I understand why people think it's a good idea, but I think it's 
> >not possible in general to solve this problem.  There is a fundamental
> >conflict between the desire to be able to input URIs from a keyboard,
> >and the desire to be able to make URIs be "meaningful" to humans.
> >If you try to accomodate more character sets, you compromise the former.
> I do not question the need for (a form) of an URI that consists only of
> the most limited character set, for input with simple keyboards, etc.
> But I think we need more.

My point is that by trying to "solve" this problem, you actually end
up making things far worse than they were.  The web already has enough
problems with links not working due to files that have moved, been
renamed, hosts that have been renamed, etc., without having to deal
with URLs that were misspelled because of character set problems.

The only thing that you gain from multilingual URLs is that they look
nice on a screen or on paper or on a business card.  And this comes at
a tremendous cost, because when someone tries to type in what they
think they see on a screen or business card, it will frequently
translate into some other sequence of octets that gets presented to
the ftp or http server.  This can happen for a wide variety of
reasons: there are dozens of different charsets in use, charset
translation tables aren't invertable, there are often several
different sequences of octets to "spell" a particular character, and
people don't know how to type those wierd (to them) characters anyway.

> I understand what you are saying, but what the world at large currently
> sees in terms of URLs is different. Every company is trying to get
> a nice domain name; there are even companies who do their
> business by organising such names for others. And every Webmaster
> is trying to make the URLs, esp. for entry points, easily recognizable
> and memorizable. Anything else is very bad marketing indeed.

Yep, it's indeed a problem.  Until there are better tools, people are
going to try to make URLs that are meaningful.  

But domains aren't going to become non-ASCII, and neither will URLs --
for the same reason.  People who try to do this with their own URLs
will only succeed in making it harder for other folks to access their
sites.  People who build multilingual URL support into their net
browsers will only end up making them harder to use.
> >This same argument surfaces from time to time in the email world.
> >People want to use their real names as email addresses, and I don't
> >blame them. But the fact is that most people can't properly type in
> >a Japanese, Chinese, Korean, Hebrew, Russian, etc., name if they
> >don't themselves read Japanese, Chinese, Korean, Hebrew, Russian, etc.
> Email addresses are not the problem. Actually, for a mailto: URL,
> RFC 1522 provides a nice way to include your name, it looks
> like this:
> <mailto: (=?ISO-8859-1?Q?Martin_J=2E_D=FCrst)>

Yes, I'm familiar with 1522.

The problem I was referring to is when people want the left-hand side
of the @ sign to be their login name, or whatever name is used on
their LAN mail system, which happens to be in some non-ASCII character
set.  Those people either have to get an ASCII email address or do
without email contact to the rest of the world.

It's really no different than people insisting on meaningful telex
addresses or meaningful phone numbers.  Any worldwide address needs to
be in a universal, widely available, character set.

> >In either case, what we're going to end up with is a non-obvious
> >mapping between the (human-meaningful) "local" version of a name, and
> >the (transcribable) one that is used when talking to the outside
> >world.  The best we can do is to build tools that help us manage this
> >mapping.
> We already have this. A Chinese file name, encoded in URL with lots
> of %HH, already has these two forms. One is the one with the %HH,
> the other is where these are resolved, and when displayed in the
> corresponding Chinese environment. The problem is that a) we don't
> call the readable one an URL, and b) for both sides, we don't have a
> clue (or not much of a clue, anyway) what the mapping is.

Right.  My point is that things are just going to go more in this
direction.  Even though it's ugly, it's the best solution (and also
the path of least resistance).

Another reasons that I think things will go in this direction is that
we need to solve the "bad link" problem.  One of the big reasons for
links becoming stale is that we want to use file naming hierarchies to
help us organize our files.  But this conflicts with the need to
produce stable identifiers for use by the outside world, because you
have to reorganize hierarchies once in awhile.  So we're going to need
some layer that maps between external names (whether they be "URNs" or
"stable URLs") and local names (filenames) to provide stability.  That
same layer can also provide charset mapping with little additional
cost, and without breaking people's ability to type URLs.

> >And there is a strong argument that (human-meaningful) names and
> >(machine-meaningful) addresses should be kept separate anyway.  Make
> >the document titles human meaningful, let's build search services that
> >understand various character sets, and let the search services resolve
> >into pure-ASCII URIs.
> I have no problem with that if you restrict URIs in such a way (e.g.
> just allowing numbers or such) that even in the English-speaking
> part of the world, there is no danger that builders of search
> services think that the URL contains meaningful information.
> Currently they do, and that's one of the reasons we are thinking
> about the problem at hand.

I don't think we can restrict URLs in that way, but there is a
significant group of people who think URNs should be opaque to
ordinary users (like ISBNs are now).  So maybe we can solve the
problem for the next generation anyway.