Re: Globalizing URIs

Martin J Duerst (mduerst@ifi.unizh.ch)
Thu, 10 Aug 1995 20:33:05 +0200 (MET DST)


Message-Id: <9508101833.AA07865@mocha.bunyip.com>
Subject: Re: Globalizing URIs
To: moore@cs.utk.edu (Keith Moore)
Date: Thu, 10 Aug 1995 20:33:05 +0200 (MET DST)
Cc: mduerst@ifi.unizh.ch, moore@cs.utk.edu, FisherM@is3.indy.tce.com,
In-Reply-To: <199508091905.PAA08849@wilma.cs.utk.edu> from "Keith Moore" at Aug 9, 95 03:05:09 pm
From: Martin J Duerst <mduerst@ifi.unizh.ch>


>My point is that by trying to "solve" this problem, you actually end
>up making things far worse than they were.  The web already has enough
>problems with links not working due to files that have moved, been
>renamed, hosts that have been renamed, etc., without having to deal
>with URLs that were misspelled because of character set problems.

Well, of course we have to work on a proposal that will avoid these
problems as well as possible. And as in the case of hosts and files
being renamed and moved, it looks that people have to do it wrong
once anyway to get the feeling and know what to do.
Calling your www host www.university.edu instead of
ibm360.university.edu very quickly became standard practice,
and something similar can be expected for our problem.

>The only thing that you gain from multilingual URLs is that they look
>nice on a screen or on paper or on a business card.  And this comes at
>a tremendous cost, because when someone tries to type in what they
>think they see on a screen or business card, it will frequently
>translate into some other sequence of octets that gets presented to
>the ftp or http server.

Therefore we have to assure that the mapping between "nice" form
and "plain" form is clear, with the necessary mechanisms.

>This can happen for a wide variety of
>reasons: there are dozens of different charsets in use, charset
>translation tables aren't invertable, there are often several
>different sequences of octets to "spell" a particular character, and
>people don't know how to type those wierd (to them) characters anyway.

Well, to those that read and type them, these characters are very natural,
and the ASCII characters, natural for us, may feel strange. As for different
representations, in Japan, there are more representations for an 'a' than
for the average Japanese Kanji!


>> I understand what you are saying, but what the world at large currently
>> sees in terms of URLs is different. Every company is trying to get
>> a nice domain name; there are even companies who do their
>> business by organising such names for others. And every Webmaster
>> is trying to make the URLs, esp. for entry points, easily recognizable
>> and memorizable. Anything else is very bad marketing indeed.
>
>Yep, it's indeed a problem.  Until there are better tools, people are
>going to try to make URLs that are meaningful.  

It's not only a tool problem. Newspapers will exist for quite some
more time.

>But domains aren't going to become non-ASCII, and neither will URLs --
>for the same reason.  People who try to do this with their own URLs
>will only succeed in making it harder for other folks to access their
>sites.  People who build multilingual URL support into their net
>browsers will only end up making them harder to use.

With the present state of affairs, yes. But not if we find good
solutions.


>It's really no different than people insisting on meaningful telex
>addresses or meaningful phone numbers.  Any worldwide address needs to
>be in a universal, widely available, character set.

It IS different. Japanese are at least as good as Americans to
create puns and remembering aids for numbers. But there is a
clear imbalance if English-language people and companies
can use their names straight, whereas others have to use them
in a mutilated form. For domain names and email addresses,
there has to be a number only (or ASCII only) form, but
for document names and such, there is no such need.


>> >In either case, what we're going to end up with is a non-obvious
>> >mapping between the (human-meaningful) "local" version of a name, and
>> >the (transcribable) one that is used when talking to the outside
>> >world.  The best we can do is to build tools that help us manage this
>> >mapping.
>>
>> We already have this. A Chinese file name, encoded in URL with lots
>> of %HH, already has these two forms. One is the one with the %HH,
>> the other is where these are resolved, and when displayed in the
>> corresponding Chinese environment. The problem is that a) we don't
>> call the readable one an URL, and b) for both sides, we don't have a
>> clue (or not much of a clue, anyway) what the mapping is.
>
>Right.  My point is that things are just going to go more in this
>direction.  Even though it's ugly, it's the best solution (and also
>the path of least resistance).

The tools you mention need something to start with.

Regards, 	Martin.