Re: Globalizing URIs

Martin J Duerst (mduerst@ifi.unizh.ch)
Wed, 16 Aug 1995 12:54:35 +0200 (MET DST)


Message-Id: <9508161054.AA02294@mocha.bunyip.com>
Subject: Re: Globalizing URIs
To: moore@cs.utk.edu (Keith Moore)
Date: Wed, 16 Aug 1995 12:54:35 +0200 (MET DST)
Cc: mduerst@ifi.unizh.ch, uri@bunyip.com, moore@cs.utk.edu
In-Reply-To: <199508142110.RAA19670@wilma.cs.utk.edu> from "Keith Moore" at Aug 14, 95 05:09:58 pm
From: Martin J Duerst <mduerst@ifi.unizh.ch>

Keith Moore responded to my posting:

>> First, I want to thank for the support that quite many of the
>> contributors have given. Although some of the quotations
>> and summaries below might be somewhat out of context, it
>> clearly shows that the view "URLs are not supposed to be
>> meaningful, so there is no problem" is not justified:
>
>I believe you've misstated several of these arguments, but I'll let
>their authors respond.

I'm happy to tell you that I have not received any such messages
up to now.


>> The problems that remain before we can find a good solution
>> seem to be the following:
>> 
>> 4) Not the URL scheme itself, but the IETF URLs (FTP, HTTP,...)
>> 	are responsible for introducing semantics.
>> 	(Jon Knight, Keith Moore)
>
>This isn't what I meant.  I meant that the protocols and file servers
>used by these URLs enforce the notion that a URL is tightly coupled to
>the filename, and filenames tend to be meaningful.

Sorry for being inprecise. In any case, the bottom line of this argument
is "they spoiled it, so let them fix it". And I tried to address that in
several of my proposals. Still, as they all spoiled it in the same way,
and because of the same reasons, I guess it would be useful if
we could develop a common suggestion for a way out.


>But you completely left out the hardest problems to solve:

I left out these problems because I thought I had addressed them
in enough detail in previous posts so as to show that they are not
that hard (if they are problems at all).

>5) Everybody knows how to type the ASCII letters and digits (some
>better than others), but otherwise, most people do not know how to
>type characters that aren't used by some language that they're
>familiar with.

I do not assume that anybody writing an English document is giving
it a Japanese document name (and thus, for some protocols, a
HRI containing some Japanese). In some posts, I assumed this
implicitly, and in other I have said it explicitly. And I am going
a little further into details here.
I assume that anybody in Japan writing an English document for
a worldwide public will, with just a little bit of common sense,
and at least on his/her second try if not on the first, understand
that that document should have a HRI that can be typed by somebody
that understands only English.

Of course, if out of this discussion there should emerge something
like a standard, I will take care that the above assumption is
explicitly mentionned, but I guess that even without that, there
is a much bigger chance that users comply to this than that they
comply to such wishful thinking as "URLs shouldn't contain semantics",
even if this is very explicitly mentionned in the RFCs.
Having the language of the document and the language of its name/
description coincide indeed increases the self-descriptive semantics,
but this is something human beings have little problems with.

The above discussion excluded some special cases, such as the case
that I give a Japanese name to an English document if I want to
withhold it from people that are not able to type in Japanese,
or the case that a user not speaking Japanese wants to view a
Japanese document just for the fun of it. (Haven't seen enough $
today? Go and have a look at a Japanese document. In many cases,
when displayed with a tool that knows nothig about Japanese encodings,
you will see many '$' on your screen.)
Somewhat more serious examples include the case of a beginner's
course in Japanese for foreigners (where at least the names for
the first few lessons should be in Japanese) and cases where you,
not being able to type Japanese, are asked to fetch and copy a document
with a Japanese name for somebody that will be able to read it.
But I guess we can assume that everybody is connected to the web
anyway, don't we?


>6) URLs expressed in character sets besides ASCII are more vulnerable
>to translation to other character sets (say ISO 8859/1 to ISO 636-XX)
>which make the URL invalid.  This translation WILL occur as the result
>of URLs being mailed around, copied from one application to another,
>or being printed on paper in one environment and typed in again in a
>different environment that uses a different charset.

You have mentionned this problem before, and I have explained why
I think it is not a real problem in previous mails, but I will repeat
my arguments here:

There are vulnerabilities in all character sets (e.g. the ASCII/EBCDIC
problem for present URLs), but the vulnerability may indeed be said
to be lower for ASCII than for things that go beyond.
Nevertheless, these vulnerabilities are now very well understood,
and satisfactory solutions are available and installed. With respect
to ISO 8859/1, a HTML web document will display Western European
accents correctly on hardware such as Mac, PC, and NeXT, even though
they use different native encodings of accented characters.
Also, in Japan, News and Email is working among a wide variety
of platforms although on various platforms, 3 different encodings
of the same character repertoire are in use. Apart from issues
such as those we have with '1', 'I', and 'l', which Japanese
are aware of as well as we are aware of them, there are no more
problems today having a Japanese text entered into a system,
transmitted to another, copied to paper from the screen by hand
or printed out, and entered again. Otherwise, Japanese information
processing would indeed by in a very bad state. The same applies
for other areas.

So interoperability remains assured among those people and
environments where it is really necessary and useful. And we
don't have to care for the rest; an abstract "interoperability"
doesn't help anyone that will never be able to read the
concerned documents.


>> As far as what appears on namecards, in newspapers, and otherwise
>> in human-readable form, it is preferably something semantically
>> meaningful, with the "location" part of it eliminated.
>> It is the location part, and not the semantic part, that is
>> affected when the resource is renamed or relocated.
>
>You want resource names that aren't tied to either location or
>meaning.  The location can change over time, and so can the meaning
>that we assign to something.  You need two levels of indirection:
>human-meaningful names -> resource name -> locations.

The meaning, in the sense of "what does that document mean to
me", of course might change, e.g. from a nice curiosity to something
extensively studied to something absolutely boring. But the meanig,
in a more abstract sense, i.e. that it is about a certain person, a certain
concept, product, or company, shouldn't change, or else the
resource name really should be changed, too, or the reader will
be very surprised indeed.

I don't object to having two levels of indirection, not at all.
But as I think experience with URLs has shown, we can't just care
for the lower level and assume that somebody else will care for
the upper level. What we as humans want to see and use is
human-meaningful names, and we will create them on the
lower level if the upper level is missing.


>> If the semantics indeed change, e.g. if I move to another university
>> or company, then there is no need for any such URL anymore,
>> other than historical.
>
>No, that's not true.  If people want to find information about *you*,
>as opposed to where you work, they need a document identifier that
>continues to point to that information about you even after you change
>jobs.  So the identifier really needs to be independent of where you
>work.

I agree. But there is always a higher level of semantics that you can
construct. For western culture this might be a little far-fetched,
but assume if you died, you would like somebody to find information
about you in your next life. So of course I have to decide whether
I want to have a description of *me* or of me at the present
workplace, and I might have difficulties of assuring a long life
for the first case with the present technology, but this doesn't
change the fact that an URL such as
	http://www.ifi.unizh.ch/staff/Martin_Duerst.html,
when viewed semantically, in fact expresses that this is about
me as a staff member somewhere. If somebody expects this
to continue to exist while knowing that I am no longer working
there, (s)he makes assumptions that the form of the URL
was designed to discourage.


>Fortunately, (a) EBCDIC systems are a vanishing breed, a (b) there is
>(by design) only one possible translation between EBCDIC and ASCII for
>the characters used in URLs.  The reason that mailing a URL from an
>ASCII world to an EBCDIC world doesn't cause problems is that URLs
>WERE DEFINED SO THAT IT WOULD NOT BE A PROBLEM.  You're proposing to
>undo that careful bit of engineering, not only for ASCII-EBCDIC but
>for every charset in use in the world.

The many local encodings besides Unicode/ISO 10646 will
most probably be a vanishing breed in the future.
What I was trying with my proposals was just to define HRIs (or
whatever you call it) so that there would not be a problem,
or not more problems than with ASCII/EBCDIC, in the sense
that wherever text is translated or transcibed correctly, the
URLs go with it.
If you see any specific problems of why any of the schemes I have
proposed will fail in this respect (and not just the general concerns
"they are more volatile" and "it's a difficult business"), I would
be very glad to hear that.


>As an alternative to needing a directory linking human-readable
>identifiers to URLs (since we're nowhere close to having one
>deployed), I propose that there be a standard form of reference to a
>URL that also includes a document title (or more than one) in a
>non-ASCII charset.  That is, instead of having multiple encodings for
>the location of a resource, embellish them with human-readable names
>for the resource.  These could get passed around in html documents,
>preserved in hotlists, and printed on business cards.  For instance:
>
><A HREF="http://foo.bar/zot" TITLE="[unicode]some-random-string">
>
>Of course, if the user types in the URL by hand, he still ends up
>typing it in ASCII.  But which is worse, having to type in the URL
>using unfamiliar characters, or having a situation where most people
>in the world cannot access your document because the URL is in a
>charset that they can't use at all?

Again, as said above, I don't care about "most people in the world",
I care about "most people that will be able to read and make use
of the document".

Otherwise, your proposal is certainly worth considering, but it has
two problems:
First, is it lengthy, which is especially unconvenient for business cards
and such. Most of what I have proposed is considerably shorter.
Second, the "[unicode]" prefix is not exactly needed. If the
"some-random-string" is in the same encoding as the document as a
whole, then there is no need to explain the encoding. Also, the encoding
is not relevant because the TITLE is not used when accessing the document
over a protocol (remember that in the detailled explanation for your
example of what path an URL might travel, the information in
[] was only used in the last step, when making a request over the
protocol. There are of course cases where the characters in
"some-random-string" will not be encodable in the current
encoding of the containing document, but for these cases, SGML/
HTML has its own mechanisms.

Regards,	Martin.