Re: Globalizing URIs

Martin J Duerst (mduerst@ifi.unizh.ch)
Thu, 17 Aug 1995 15:22:18 +0200 (MET DST)


Message-Id: <9508171323.AA21577@mocha.bunyip.com>
Subject: Re: Globalizing URIs
To: moore@cs.utk.edu (Keith Moore)
Date: Thu, 17 Aug 1995 15:22:18 +0200 (MET DST)
Cc: mduerst@ifi.unizh.ch, moore@cs.utk.edu, uri@bunyip.com
In-Reply-To: <199508162232.SAA25273@wilma.cs.utk.edu> from "Keith Moore" at Aug 16, 95 06:32:42 pm
From: Martin J Duerst <mduerst@ifi.unizh.ch>


>That's fine.  HTTP servers, at least, are fairly free to work out
>whatever mapping they wish between URLs and filenames.  (It's less
>clear about whether FTP servers can do this).

I guess that as far as most of my proposals are concerned, this is
not a problem with FTP, either. FTP servers are not forbidden to serve
an additional file telling the client how to reasonably interpret the
filenames they serve. Also, I think the FTP protocol doesn't specify
that exactly the same binary representation has to be used for
filenames locally as over the protocol connection.
And if we create a new scheme such as HFTP, then we have even
less restrictions.

>> I do not assume that anybody writing an English document is giving
>> it a Japanese document name (and thus, for some protocols, a
>> HRI containing some Japanese). In some posts, I assumed this
>> implicitly, and in other I have said it explicitly. And I am going
>> a little further into details here.
>> I assume that anybody in Japan writing an English document for
>> a worldwide public will, with just a little bit of common sense,
>> and at least on his/her second try if not on the first, understand
>> that that document should have a HRI that can be typed by somebody
>> that understands only English.
>
>If you want an HRI that is only available to others in the same
>country, this is fine.  I'm thinking in terms of worldwide
>interoperability.

When I said "Japan" above, this was just an example; Japanese HRI
are in no way restricted to Japan, although they are restricted
to Japanese speaking/reading people, which makes a lot of sense
if they refer to documents that only these people are able to understand.


>My understanding is that there is a bit more
>uniformity in Japan about which character sets are in use, than there
>are in other parts of the world.

In this respect, I would say that Japan is just about an average case.
What is important is not so much how many variants there are
around, but that there are different variants (which clearly applies
for Japan, as PCs and workstations use different encodings, which
both differ from the encoding used in email and such),
and that this will prevent us to make uniformity assumptions
that are too optimistic.


>But just taking Japanese as an
>example, what happens if a Japanese-fluent person in the US wants to
>read a Japanese document that has a Japanese HRL?  He can read
>Japanese but perhaps doesn't have a Japanese keyboard.  How is he
>going to type in the HRL?

A "Japanese keyboard" is nowadays a software issue. The ease or difficulty
with which it is possible to install additional software that allows to
read and input foreign languages is not the same on every system,
but is already quite high and steadily improving. It's still so that
additional software has to be bought or installed, and this is easier
on e.g. a Mac than on a Unix system, but this just reflects the general
difference of ease of use for these systems. On a Mac, you can buy
the JLK (Japanese language kit, there is also a Chinese one and probably
a Korean one, and for other languages, you don't even need that much)
worldwide, and installation takes not more than an hour (mainly because
of the many floppies with the large fonts). For a Unix system with X11,
the MIT distribution already contains several items necessary for Japanese,
and their installation is no problem if you have a friendly system
administrator (who does not need to speek Japanese).


>It's my understanding that there are many countries in which there is
>little or no uniformity from one site to another (or even sometimes
>within a particular site) as to what character sets are used.  There
>are also groups of countries that share a language, but have different
>conventions for what character set to use within that country.

The proposals I have made can very well take care of non-uniformity.
Some do assume uniformity for a single site, but others don't.
And as I have said, Japan in this respect is a very adequate example.
Also, I can tell you that Japan is not the only country I am familiar
with (although I have to say it is the one, besides western Europe,
I am most familar with). I have implemented software for Korean
input, for Arabic display, and for conversions for many other
places.


>> There are vulnerabilities in all character sets (e.g. the ASCII/EBCDIC
>> problem for present URLs), but the vulnerability may indeed be said
>> to be lower for ASCII than for things that go beyond.
>> Nevertheless, these vulnerabilities are now very well understood,
>> and satisfactory solutions are available and installed. With respect
>> to ISO 8859/1, a HTML web document will display Western European
>> accents correctly on hardware such as Mac, PC, and NeXT, even though
>> they use different native encodings of accented characters.
>
>Yes, they'll display the same, but does that mean that the users know
>how to type them in?

With the increasing distribution of the web, there might indeed apear
users who see use the computer more or less like interactive TV, just
clicking with the mouse on a button, and not knowing or using a keyboard
anymore. But this is a problem that may as well appear in the US as
elsewere.

>What happens if you copy a section of text from
>the web browser (which understands 8859/1) to another application that
>understands a Mac charset or a PC code page?

I just had a little try with Netscape on my Mac. On cut/copy/paste, it
does exactly what the average user would expect but what you seem
to have difficulties to believe that it is possible. And this not only
applies to the translation from ISO 8859-1 to Mac-Roman, but
also to the translation from the various Japanese encodings to
the one used on the Mac.


>> Also, in Japan, News and Email is working among a wide variety
>> of platforms although on various platforms, 3 different encodings
>> of the same character repertoire are in use. Apart from issues
>> such as those we have with '1', 'I', and 'l', which Japanese
>> are aware of as well as we are aware of them, there are no more
>> problems today having a Japanese text entered into a system,
>> transmitted to another, copied to paper from the screen by hand
>> or printed out, and entered again. Otherwise, Japanese information
>> processing would indeed by in a very bad state. The same applies
>> for other areas.
>
>I'm glad to hear that things are going so well in Japan, but I'm
>told that things are not so nice in other areas.

Different areas are in different stages of developping their information
infrastructure. And I don't demand that we design a scheme so that
we can have HRIs for a script for which email conventions are not
yet reasonably established. It is very clear that having native email
capabilities, and otherwise the capability of exchanging document
contents reliably, is more important that document names and
identifications.
But what we should be working on is a proposal so that once
conventions for document contents are established (e.g. by
MIME types being defined, or by defining equivalences to the
global ISO 10646, depending on the proposal), this can be extended
without additional work to resource identifiers.


>I agree with that statement.  It's not sufficient to simply say "this
>must be solved by the upper layer".  We must BUILD the upper layers.
>On the other hand, it will be up to Japanese speakers to build upper
>layers (search services that map titles to URLs or URNs) that
>understand Japanese.  I don't see any reason why these can't be built
>now to link Japanese titles to URLs, and modified later to link
>Japanese titles to URNs.

I think it is safe to say that such upper layers shouldn't be designed
and implemented separately for each language or script our country,
but that as much as possible a general solution should be sought for.
I don't request that the US does develop everything for everybody,
Japanese and many others around the world can very well contribute
their share. But as I have said before, I think it is unfair to pretend
that one has a global and abstract solution when in practice it favors
some groups over others, and then just tell them: well, if you want
the same functionality, why not do it yourselves.


>> The many local encodings besides Unicode/ISO 10646 will
>> most probably be a vanishing breed in the future.
>
>The jury is still out on Unicode.  We've got at least one of almost
>every major computer system here, but I don't know of a single one of
>them that supports Unicode in its current release.  (pre-released
>products don't count)

Please have a look at http://www.stonehand.com/unicode/products.html,
and that site in general. Windows NT, Penpoint, and the Newton definitely
work with Unicode. And there is a Unicode "locale" for IBM AIX, you only
have to care to get and install it.
And the jury on Unicode depends as much on companies implementing
it (and I can assure you that most major players definitely are working
on that, although their plans for deployment and distribution may
differ) as it depends on other communities to access it and propose
its use, in adequate forms and places, when they see that it indeed
can simplify things and solve problems.


>I've mentioned the specific problems I see, and I think they're pretty
>serious.

I have given more details on the problems you see. If you still have
some questions, please follow up.


>If you assume that they're transcribable, or you limit the
>domain of applicability of HRLs to environments where they are
>transcribable, you might be able to address the rest of the concerns.

Many thanks for this (at least partially) positive statement.
The fact that HRIs come after mail and such pretty much automatically
assures transcribability (in those environments where it makes sense
to have a look at the documents, anyway).


>Okay, but be careful about assuming that things are as nice everywhere
>as they are in Japan, and make sure you think about speakers of a
>particular language living outside of the nice environment for that
>language -- sometimes there are large numbers of these.

I have addressed these concerns above. I am myself in that situation
(using Japanese outside Japan) and know how it feels. The one point
where Japan (and the Japanese outside Japan) may be special is that
it is a very atractive market financially, but that is just one argument
more of why we should work for general solutions (and the proposals
I have made do in no way priviledge Japan), and not wait until somebody
in some country comes up with a solution that may work there some way
or another, but will lead to even more clumsy solutions if it is transferred
to other places. That has happend many times before, unfortunately.


>> First, is it lengthy, which is especially unconvenient for business cards
>> and such. Most of what I have proposed is considerably shorter.
>
>This can be dealt with.
>Format them like so:
>
>http://dom.ain/[document-id]/human-readable-string

>Being realistic, this won't help people to remember a particular URL.

Yes. It just gives half of the benefits of what a current URL
gives an English reader or writer. If possible, I would like
to do better than that.


>> Second, the "[unicode]" prefix is not exactly needed. If the
>> "some-random-string" is in the same encoding as the document as a
>> whole, then there is no need to explain the encoding. 
>
>Maybe, maybe not.  If you want the things to survive being emailed
>around, I'd still recommend that you encode things in ASCII and
>include a charset name (similar to RFC 1522).

If things are mailed around as part of a document, the encoding
(MIME charset parameter) is of course included (but it is RFC 1521
that applies in this case), and sometimes it is implied between
people that know from each other what encodings they use.
Whether the results can be called ASCII is a different question;
they may be 7-bit or 8-bit, or whatever, and maybe further
be sumbitted to a transfer encoding such as BASE64.


Regards,	Martin.