Re: Globalizing URIs

Martin J Duerst (mduerst@ifi.unizh.ch)
Fri, 11 Aug 1995 22:10:40 +0200 (MET DST)


Message-Id: <9508112011.AA16938@mocha.bunyip.com>
Subject: Re: Globalizing URIs
To: moore@cs.utk.edu (Keith Moore)
Date: Fri, 11 Aug 1995 22:10:40 +0200 (MET DST)
Cc: mduerst@ifi.unizh.ch, moore@cs.utk.edu, FisherM@is3.indy.tce.com,
In-Reply-To: <199508102157.RAA10879@wilma.cs.utk.edu> from "Keith Moore" at Aug 10, 95 05:57:15 pm
From: Martin J Duerst <mduerst@ifi.unizh.ch>


>I wish you luck.  The problem is that there isn't one "nice" form,
>there are lots of them, and you don't have control over how these
>things get passed around.
>
>Example:
[Next sentence moved up form below.]
>If all of this works, it will be a miracle.

So let's see how this miracle works, for the proposals I have
identified with A) in my proposal list:
(again, upper case is assumed to stand for characters outside ASCII)

>The author takes a filename on a file server (in the server's local charset)

Assume this looks like "AA.html", with a corresponding plain form of
"%aa%aa.html". 


>and translate it to a multilingual URL.

Assume the encoding (MIME "charset") is "myenc". With the proposal,
this gives a nice for that looks like "[myenc]AA.html", with plain
"[myenc]%aa%aa.html".


>The reader's web browser displays that URL so that it looks nice.

In the path from the server to the browser, the actual octets behind
"AA" may have changed, but the http protocol provides information
so that this conversion is done correctly. It is the same conversion
that is taking place for the whole rest of the text. The important
thing is that "AA" still LOOKS the same.


>The user copies that URL with a mouse into another window
>maybe into a word processor that uses a different charset than the browser.

Applications on the same machine are usually quite uniform in their
use of character encoding. Otherwise, it is the responsibility of the
applications to en- and decode the characters properly according
to the conventions of the window system. X11, for example, has such
conventions, and it is not too difficult for an application to do the right
thing. If it's not done correctly, URLs will not be the first that are
affected. All native text will be equally affected, and the user will
not get the idea to copy anything.


>The document gets printed out

Given reasonable printer settings, the nice URL will still look nicely
the same, namely "[myenc]AA.html".


>or maybe emailed through a gateway that translates from
>the local charset into one that's more likely to be usable by a
>typical MIME mail reader.

MIME, specifically RFC 1521, was indeed designed to care for
such problems. Again, if the URL won't make it, the rest of the
text won't, either, and so the user will never think about sending
an URL that way. After transmission, the octets that represent
"AA" may have changed, but the URL still looks nicely the same.


>Someone else gets that document and types in the URL, whereupon

Whatever encoding the editor or widget uses, the user will know
how to enter the URL so that it looks as nice as before:
"[myenc]AA.html".


>it gets transmitted to the file server, and the
>file server tries to translate the URL back to a filename.
The program that deals with URL relolution in the browser finally
has to use the information [myenc] that has been carried along
all the time. The program knows its internal encoding and the
"myenc" encoding, and will convert the octet representation
of "AA.html" from one to the other, and then submit the request.


The same story could be told for my other proposals, with sligth
modifications. There are no miracles. If all these transmissions
work for plain text, they will work for URLs. And if they don't
no user will get the idea that an URL should be sent that way.


>The only way I can see that this would work would be to *always* keep
>the "backward compatibility" pure-ASCII form attached to the "pretty"
>one.  This would mean, for instance, that when you "copy" a URL from
>your web browser to another application, it would include the
>pure-ASCII form -- even if the user only saw the pretty one in his URL
>window.

Obviously, this is not needed. In the "Unicode" proposal (part B) of my
proposal list), there is in fact a fixed mapping between the two representations.
What is important is that for the "nice" form, it counts how it looks
(and not how it is encoded).


>Presumably, users would learn to include both the pretty URL
>and the ASCII one on paper documents and business cards -- (much as
>Japanese business cards I've seen that include formally written,
>phonetic, and romanized versions of names and titles on the same
>card.)

Not exactly. A Japanese business card would contain an URL
<[ISO-2022-JP]http://www.mycompany.co.jp/STAFF/MYNAME.html>
(upercase being Japanese) that would point to a Japanese document,
and on the backside an URL
<http://www.mycompany.co.jp/staff/myname.html>
pointing to an English document. For proposal B) and C), the Japanese
form will look only like
<http://www.mycompany.co.jp/STAFF/MYNAME.html>

As an aside, Japanese business cards don't contain phonetic transcriptions
(other than the romanized versions intended for foreigners).


>And you will probably need to make sure that the "charset tag" is
>always part of the URL is *visible* -- even when displaying it in that
>charset.  (otherwise, when the URL is copied with pencil and paper and
>then back to a keyboard, the app will not be able to tell which
>charset it was in and may interpret it differently.)

For proposal A), you are right. For the others, no.


>> Well, to those that read and type them, these characters are very natural,
>> and the ASCII characters, natural for us, may feel strange. As for different
>> representations, in Japan, there are more representations for an 'a' than
>> for the average Japanese Kanji!
>
>I don't doubt that.  But my guess is that any Japanese user of
>Internet email already knows how to generate the octet value for 'm'
>in 'mduerst@ifi.unizh.ch' in such a way that your mail reader will
>accept it.

I don't deny that. But your original post gave the impression that it would
be more difficult for a Japanese user to enter Kanji than to enter an 'm',
which is definitely not the case.


>My point was that we really need better standardized ways to find
>documents than by typing in URLs, and better ways to learn about the
>characteristics of a document than by examining its URL.  Until we
>have them, we're going to keep trying to put features into URLs that
>don't belong there, like the title of the document, and content-type
>information, and whether it's suitable for children.

See my answer to Karen for my oppininion on this point.


>> With the present state of affairs, yes. But not if we find good
>> solutions.
>
>Again, I wish you luck.  I pray that you find a good solution.
>But please remember that a poor solution to this problem could
>well be worse than not solving it at all.

I agree. That's why I gave this long list of proposals. And I am looking
forward to specific comments.


>> >It's really no different than people insisting on meaningful telex
>> >addresses or meaningful phone numbers.  Any worldwide address needs to
>> >be in a universal, widely available, character set.

I forgot to say this in my last posting: Please stop calling ASCII
an universal charcter set. Unicode/ISO 10646 is universal, ASCII
is just a poor lowest common denominator.


>> It IS different. Japanese are at least as good as Americans to
>> create puns and remembering aids for numbers. But there is a
>> clear imbalance if English-language people and companies
>> can use their names straight, whereas others have to use them
>> in a mutilated form. For domain names and email addresses,
>> there has to be a number only (or ASCII only) form, but
>> for document names and such, there is no such need.
>
>I see your point, but I think that the solution (as in the case of
>telephone numbers) is not to change the address, but to build a
>directory.  Perhaps those that don't ordinarily use latin characters
>have more incentive to build it, but the rest of us will find it
>useful nonetheless.

Guess what: If URLs were limited to numbers or consonants only,
we would already have these nice directory services. It's really
easy to say "these are our intentions, but we implement them so
that WE don't have to do the work".


>> >Right.  My point is that things are just going to go more in this
>> >direction.  Even though it's ugly, it's the best solution (and also
>> >the path of least resistance).
>> 
>> The tools you mention need something to start with.
>
>The trick is to get things going in the right direction, so that we
>don't paint ourselves into another corner.

Well, for the last few years and with the present implementation,
we obviously didn't get the "trick", and we already are in that corner,
so the problem is now to make sure that everybody, and not just
the English-speaking part of the world, is feeling reasonably
comfortable in that corner.
If we ever get a second chance (and for URLs as used in WWW and
such I doubt that this will happen, but I don't know too much about
what may be in the oven), we can make a second attempt at getting
the trick, and hopefully we will get it.

Regards,	Martin.