Re: Globalizing URIs

Keith Moore (moore@cs.utk.edu)
Mon, 14 Aug 1995 17:09:58 -0400


Message-Id: <199508142110.RAA19670@wilma.cs.utk.edu>
From: Keith Moore <moore@cs.utk.edu>
To: Martin J Duerst <mduerst@ifi.unizh.ch>
Cc: uri@bunyip.com, moore@cs.utk.edu
Subject: Re: Globalizing URIs 
In-Reply-To: Your message of "Mon, 14 Aug 1995 17:02:05 +0200."
             <9508141502.AA06947@mocha.bunyip.com> 
Date: Mon, 14 Aug 1995 17:09:58 -0400

> First, I want to thank for the support that quite many of the
> contributors have given. Although some of the quotations
> and summaries below might be somewhat out of context, it
> clearly shows that the view "URLs are not supposed to be
> meaningful, so there is no problem" is not justified:

I believe you've misstated several of these arguments, but I'll let
their authors respond.

> The problems that remain before we can find a good solution
> seem to be the following:
> 
> 1) Using meanig in a name increases the chance that the name
> 	will become invalid.
> 	(Karen Sollins, Keith Moore, Paul Hoffman)
> 2) There should be only one form of URL.
> 	(Keith Moore, Paul Hoffman)
> 3) There should be something human-readable, but it should/might be
> 	on another layer than URIs.
> 	(Karen Sollins, John Curran)
> 4) Not the URL scheme itself, but the IETF URLs (FTP, HTTP,...)
> 	are responsible for introducing semantics.
> 	(Jon Knight, Keith Moore)

This isn't what I meant.  I meant that the protocols and file servers
used by these URLs enforce the notion that a URL is tightly coupled to
the filename, and filenames tend to be meaningful.

But you completely left out the hardest problems to solve:

5) Everybody knows how to type the ASCII letters and digits (some
better than others), but otherwise, most people do not know how to
type characters that aren't used by some language that they're
familiar with.

6) URLs expressed in character sets besides ASCII are more vulnerable
to translation to other character sets (say ISO 8859/1 to ISO 636-XX)
which make the URL invalid.  This translation WILL occur as the result
of URLs being mailed around, copied from one application to another,
or being printed on paper in one environment and typed in again in a
different environment that uses a different charset.


> I would like first to adress 1), because here I still have my doubts:
> What is the difference between a disappearing telephone number
> (no semantics) and a disappearing postal address (semantics).

Not much.  But the postal address isn't a good analogy to the URL,
because people don't generally choose their own postal addresses.

> What is the use of trying to achieve "ethernal life" for references
> when the referred resources themselves may have a short life?

Many of the resources do have a short life.  But those that have a
long life are important.

> As far as what appears on namecards, in newspapers, and otherwise
> in human-readable form, it is preferably something semantically
> meaningful, with the "location" part of it eliminated.
> It is the location part, and not the semantic part, that is
> affected when the resource is renamed or relocated.

You want resource names that aren't tied to either location or
meaning.  The location can change over time, and so can the meaning
that we assign to something.  You need two levels of indirection:
human-meaningful names -> resource name -> locations.

But at least part of the "stale URL" problem is more subtle: we don't
give URLs names that are meaningful to outsiders as much as we give
them names that are meaningful to their maintainers.  They reflect the
structure of the file systems on which they reside.  These structures
(and therefore the URLs) get changed every so often as file servers
get re-organized, even though neither the files themselves, nor the
meanings of the files, changed.

And I agree that the web would not have gotten this far without having
URLs that contain human-meaningful strings.  But we're now seeing the
limits of that approach, and it's pretty clear that the web needs to
evolve past it.

> If the semantics indeed change, e.g. if I move to another university
> or company, then there is no need for any such URL anymore,
> other than historical.

No, that's not true.  If people want to find information about *you*,
as opposed to where you work, they need a document identifier that
continues to point to that information about you even after you change
jobs.  So the identifier really needs to be independent of where you
work.

> The second point above (there should only be one form of
> URL) is quite understandable. But first, it should be noted
> that even at present, there is more than one representation.
> The same URL is represented differently on EBCDIC systems
> than on ASCII systems, although it looks the same on both
> systems. Nobody has objected to this, but it is very similar
> in structure to what I described happening to a "nice" form
> of an URL.

Fortunately, (a) EBCDIC systems are a vanishing breed, a (b) there is
(by design) only one possible translation between EBCDIC and ASCII for
the characters used in URLs.  The reason that mailing a URL from an
ASCII world to an EBCDIC world doesn't cause problems is that URLs
WERE DEFINED SO THAT IT WOULD NOT BE A PROBLEM.  You're proposing to
undo that careful bit of engineering, not only for ASCII-EBCDIC but
for every charset in use in the world.

As an alternative to needing a directory linking human-readable
identifiers to URLs (since we're nowhere close to having one
deployed), I propose that there be a standard form of reference to a
URL that also includes a document title (or more than one) in a
non-ASCII charset.  That is, instead of having multiple encodings for
the location of a resource, embellish them with human-readable names
for the resource.  These could get passed around in html documents,
preserved in hotlists, and printed on business cards.  For instance:

<A HREF="http://foo.bar/zot" TITLE="[unicode]some-random-string">

Of course, if the user types in the URL by hand, he still ends up
typing it in ASCII.  But which is worse, having to type in the URL
using unfamiliar characters, or having a situation where most people
in the world cannot access your document because the URL is in a
charset that they can't use at all?

Keith