Re: Globalizing URIs

Keith Moore (moore@cs.utk.edu)
Wed, 16 Aug 1995 18:32:42 -0400


Message-Id: <199508162232.SAA25273@wilma.cs.utk.edu>
From: Keith Moore <moore@cs.utk.edu>
To: Martin J Duerst <mduerst@ifi.unizh.ch>
Cc: moore@cs.utk.edu (Keith Moore), uri@bunyip.com
Subject: Re: Globalizing URIs 
In-Reply-To: Your message of "Wed, 16 Aug 1995 12:54:35 +0200."
             <199508161054.GAA21374@CS.UTK.EDU> 
Date: Wed, 16 Aug 1995 18:32:42 -0400

> >> The problems that remain before we can find a good solution
> >> seem to be the following:
> >> 
> >> 4) Not the URL scheme itself, but the IETF URLs (FTP, HTTP,...)
> >> 	are responsible for introducing semantics.
> >> 	(Jon Knight, Keith Moore)
> >
> >This isn't what I meant.  I meant that the protocols and file servers
> >used by these URLs enforce the notion that a URL is tightly coupled to
> >the filename, and filenames tend to be meaningful.
> 
> Sorry for being inprecise. In any case, the bottom line of this argument
> is "they spoiled it, so let them fix it". And I tried to address that in
> several of my proposals. Still, as they all spoiled it in the same way,
> and because of the same reasons, I guess it would be useful if
> we could develop a common suggestion for a way out.

That's fine.  HTTP servers, at least, are fairly free to work out
whatever mapping they wish between URLs and filenames.  (It's less
clear about whether FTP servers can do this).

> >But you completely left out the hardest problems to solve:
> 
> I left out these problems because I thought I had addressed them
> in enough detail in previous posts so as to show that they are not
> that hard (if they are problems at all).

Sorry, I didn't respond to your previous post where you attempted to
refute my arguments.  I'll respond now.

> >5) Everybody knows how to type the ASCII letters and digits (some
> >better than others), but otherwise, most people do not know how to
> >type characters that aren't used by some language that they're
> >familiar with.
> 
> I do not assume that anybody writing an English document is giving
> it a Japanese document name (and thus, for some protocols, a
> HRI containing some Japanese). In some posts, I assumed this
> implicitly, and in other I have said it explicitly. And I am going
> a little further into details here.
> I assume that anybody in Japan writing an English document for
> a worldwide public will, with just a little bit of common sense,
> and at least on his/her second try if not on the first, understand
> that that document should have a HRI that can be typed by somebody
> that understands only English.

If you want an HRI that is only available to others in the same
country, this is fine.  I'm thinking in terms of worldwide
interoperability.  My understanding is that there is a bit more
uniformity in Japan about which character sets are in use, than there
are in other parts of the world.  But just taking Japanese as an
example, what happens if a Japanese-fluent person in the US wants to
read a Japanese document that has a Japanese HRL?  He can read
Japanese but perhaps doesn't have a Japanese keyboard.  How is he
going to type in the HRL?

It's my understanding that there are many countries in which there is
little or no uniformity from one site to another (or even sometimes
within a particular site) as to what character sets are used.  There
are also groups of countries that share a language, but have different
conventions for what character set to use within that country.


> >6) URLs expressed in character sets besides ASCII are more vulnerable
> >to translation to other character sets (say ISO 8859/1 to ISO 636-XX)
> >which make the URL invalid.  This translation WILL occur as the result
> >of URLs being mailed around, copied from one application to another,
> >or being printed on paper in one environment and typed in again in a
> >different environment that uses a different charset.
> 
> You have mentionned this problem before, and I have explained why
> I think it is not a real problem in previous mails, but I will repeat
> my arguments here:
> 
> There are vulnerabilities in all character sets (e.g. the ASCII/EBCDIC
> problem for present URLs), but the vulnerability may indeed be said
> to be lower for ASCII than for things that go beyond.
> Nevertheless, these vulnerabilities are now very well understood,
> and satisfactory solutions are available and installed. With respect
> to ISO 8859/1, a HTML web document will display Western European
> accents correctly on hardware such as Mac, PC, and NeXT, even though
> they use different native encodings of accented characters.

Yes, they'll display the same, but does that mean that the users know
how to type them in?  What happens if you copy a section of text from
the web browser (which understands 8859/1) to another application that
understands a Mac charset or a PC code page?

> Also, in Japan, News and Email is working among a wide variety
> of platforms although on various platforms, 3 different encodings
> of the same character repertoire are in use. Apart from issues
> such as those we have with '1', 'I', and 'l', which Japanese
> are aware of as well as we are aware of them, there are no more
> problems today having a Japanese text entered into a system,
> transmitted to another, copied to paper from the screen by hand
> or printed out, and entered again. Otherwise, Japanese information
> processing would indeed by in a very bad state. The same applies
> for other areas.

I'm glad to hear that things are going so well in Japan, but I'm
told that things are not so nice in other areas.



> I don't object to having two levels of indirection, not at all.  But
> as I think experience with URLs has shown, we can't just care for
> the lower level and assume that somebody else will care for the
> upper level. What we as humans want to see and use is
> human-meaningful names, and we will create them on the lower level
> if the upper level is missing.

I agree with that statement.  It's not sufficient to simply say "this
must be solved by the upper layer".  We must BUILD the upper layers.
On the other hand, it will be up to Japanese speakers to build upper
layers (search services that map titles to URLs or URNs) that
understand Japanese.  I don't see any reason why these can't be built
now to link Japanese titles to URLs, and modified later to link
Japanese titles to URNs.

> >> If the semantics indeed change, e.g. if I move to another university
> >> or company, then there is no need for any such URL anymore,
> >> other than historical.
> >
> >No, that's not true.  If people want to find information about *you*,
> >as opposed to where you work, they need a document identifier that
> >continues to point to that information about you even after you change
> >jobs.  So the identifier really needs to be independent of where you
> >work.
> 
> I agree. But there is always a higher level of semantics that you can
> construct. For western culture this might be a little far-fetched,
> but assume if you died, you would like somebody to find information
> about you in your next life. 

Sure.  Or your descendants, or what other people had done with your
creative work.

> So of course I have to decide whether
> I want to have a description of *me* or of me at the present
> workplace, and I might have difficulties of assuring a long life
> for the first case with the present technology, but this doesn't
> change the fact that an URL such as
> 	http://www.ifi.unizh.ch/staff/Martin_Duerst.html,
> when viewed semantically, in fact expresses that this is about
> me as a staff member somewhere. 

Right.  Depending on your point-of-view, this is either a bug or a
feature.

> If somebody expects this
> to continue to exist while knowing that I am no longer working
> there, (s)he makes assumptions that the form of the URL
> was designed to discourage.

maybe not "designed to discourage" so much as simply "wasn't designed
to encourage".

> >Fortunately, (a) EBCDIC systems are a vanishing breed, a (b) there is
> >(by design) only one possible translation between EBCDIC and ASCII for
> >the characters used in URLs.  The reason that mailing a URL from an
> >ASCII world to an EBCDIC world doesn't cause problems is that URLs
> >WERE DEFINED SO THAT IT WOULD NOT BE A PROBLEM.  You're proposing to
> >undo that careful bit of engineering, not only for ASCII-EBCDIC but
> >for every charset in use in the world.
> 
> The many local encodings besides Unicode/ISO 10646 will
> most probably be a vanishing breed in the future.

The jury is still out on Unicode.  We've got at least one of almost
every major computer system here, but I don't know of a single one of
them that supports Unicode in its current release.  (pre-released
products don't count)

There were also people who said for many years that OSI protocols
would take over the world.  I'll believe it when I see it.

> What I was trying with my proposals was just to define HRIs (or
> whatever you call it) so that there would not be a problem,
> or not more problems than with ASCII/EBCDIC, in the sense
> that wherever text is translated or transcibed correctly, the
> URLs go with it.
> If you see any specific problems of why any of the schemes I have
> proposed will fail in this respect (and not just the general concerns
> "they are more volatile" and "it's a difficult business"), I would
> be very glad to hear that.

I've mentioned the specific problems I see, and I think they're pretty
serious.  If you assume that they're transcribable, or you limit the
domain of applicability of HRLs to environments where they are
transcribable, you might be able to address the rest of the concerns.
 
> >As an alternative to needing a directory linking human-readable
> >identifiers to URLs (since we're nowhere close to having one
> >deployed), I propose that there be a standard form of reference to a
> >URL that also includes a document title (or more than one) in a
> >non-ASCII charset.  That is, instead of having multiple encodings for
> >the location of a resource, embellish them with human-readable names
> >for the resource.  These could get passed around in html documents,
> >preserved in hotlists, and printed on business cards.  For instance:
> >
> ><A HREF="http://foo.bar/zot" TITLE="[unicode]some-random-string">
> >
> >Of course, if the user types in the URL by hand, he still ends up
> >typing it in ASCII.  But which is worse, having to type in the URL
> >using unfamiliar characters, or having a situation where most people
> >in the world cannot access your document because the URL is in a
> >charset that they can't use at all?
> 
> Again, as said above, I don't care about "most people in the world",
> I care about "most people that will be able to read and make use
> of the document".

Okay, but be careful about assuming that things are as nice everywhere
as they are in Japan, and make sure you think about speakers of a
particular language living outside of the nice environment for that
language -- sometimes there are large numbers of these.

> Otherwise, your proposal is certainly worth considering, but it has
> two problems:
> First, is it lengthy, which is especially unconvenient for business cards
> and such. Most of what I have proposed is considerably shorter.

This can be dealt with.  If the HRL isn't based on an actual file
name, and the HTTP server performs a mapping, the machine readable
portion of the HRL can be any name which is sufficiently long to make
it unambiguous.  So it could simply be a short string of letters or
digits.  If there were a million different files on a server, you
could name them all using only six digit characters.  Six alphanumeric
characters designed to be transcribable (digits 0-9 plus upper case
letters A-Z excepting I, O, and Q) would be 33**6 different names --
practically infinite space.

So when comparing an HRL that used *only* human-readable characters,
with one that used human-readable characters for display and an ASCII
subset for actually accessing the object, the additional overhead
would appear to be only a few characters: the six characters for
machine readable use, and the characters used to separate the two
parts.  Format them like so:

http://dom.ain/[document-id]/human-readable-string

and just teach the http servers to ignore anything following the
document-id (enclosed in brackets or some such as a hint to the http
server).  Now all you need is tools to check in new documents, assign
unique document-id's to them, and link or copy the documents into the
right directory on the server.

One nice thing about this notation is that even if it's
mis-transcribed (in the human-readable-string), it still works.

Though there's still some effort required to teach people that
they can ignore the human-readable-string if they want to.

Being realistic, this won't help people to remember a particular URL.
But it will help people to be able to look at a URL and recognize what
it is.  It will also allow people to mis-label URLs: they can say
"this is a picture of baby ducks" when it is really kiddie porn.  But
I suppose they can do this already in English, why not extend this
capability to the rest of the world?.

> Second, the "[unicode]" prefix is not exactly needed. If the
> "some-random-string" is in the same encoding as the document as a
> whole, then there is no need to explain the encoding. 

Maybe, maybe not.  If you want the things to survive being emailed
around, I'd still recommend that you encode things in ASCII and
include a charset name (similar to RFC 1522).

Keith