Re: Globalizing URIs

Keith Moore wrote, on a posting of mine:

>> Although an URL is technically spoken just a sequence of octets,
>> encoded with %HH if necessary, these octets in more cases than
>> not represent characters, and there are many occasions on which
>> it would be desirable to show the actual characters to a user,
>> which, in an international setting, is only possible if the
>> character set and encoding of these characters are known.
>
>I understand why people think it's a good idea, but I think it's 
>not possible in general to solve this problem.  There is a fundamental
>conflict between the desire to be able to input URIs from a keyboard,
>and the desire to be able to make URIs be "meaningful" to humans.
>If you try to accomodate more character sets, you compromise the former.

I do not question the need for (a form) of an URI that consists only of
the most limited character set, for input with simple keyboards, etc.
But I think we need more.


>Even if you try to have two spellings of a URI (one in ASCII, the other
>human-meaningful to non-English speakers), the latter approach loses, because 
>it's more important that people be able to transcribe URIs than that they
>be able to understand them.  URIs are going to become *less* meaningful as
>time goes on anyway, because of other concerns (scalability, long-term 
>stability, etc.)  So trying to make them human-readable is a wash.

I understand what you are saying, but what the world at large currently
sees in terms of URLs is different. Every company is trying to get
a nice domain name; there are even companies who do their
business by organising such names for others. And every Webmaster
is trying to make the URLs, esp. for entry points, easily recognizable
and memorizable. Anything else is very bad marketing indeed.


>This same argument surfaces from time to time in the email world.
>People want to use their real names as email addresses, and I don't
>blame them. But the fact is that most people can't properly type in
>a Japanese, Chinese, Korean, Hebrew, Russian, etc., name if they
>don't themselves read Japanese, Chinese, Korean, Hebrew, Russian, etc.

Email addresses are not the problem. Actually, for a mailto: URL,
RFC 1522 provides a nice way to include your name, it looks
like this:
<mailto: mduerst@ifi.unizh.ch (=?ISO-8859-1?Q?Martin_J=2E_D=FCrst)>
But for documents from http or ftp, if somebody is able to read it,
(s)he on average should be able to type in its title in the same language.
And neither for http nor for ftp there is a solution available.


>In either case, what we're going to end up with is a non-obvious
>mapping between the (human-meaningful) "local" version of a name, and
>the (transcribable) one that is used when talking to the outside
>world.  The best we can do is to build tools that help us manage this
>mapping.

We already have this. A Chinese file name, encoded in URL
with lots of %HH, already has these two forms. One is the one
with the %HH, the other is where these are resolved, and when
displayed in the corresponding Chinese environment. The problem
is that a) we don't call the readable one an URL, and b) for both sides,
we don't have a clue (or not much of a clue, anyway) what
the mapping is.


>And there is a strong argument that (human-meaningful) names and
>(machine-meaningful) addresses should be kept separate anyway.  Make
>the document titles human meaningful, let's build search services that
>understand various character sets, and let the search services resolve
>into pure-ASCII URIs.

I have no problem with that if you restrict URIs in such a way (e.g.
just allowing numbers or such) that even in the English-speaking
part of the world, there is no danger that builders of search
services think that the URL contains meaningful information.
Currently they do, and that's one of the reasons we are thinking
about the problem at hand.

Regards,	Martin.

Received on Wednesday, 9 August 1995 12:38:46 UTC