Re: Globalizing URIs

Martin J Duerst (mduerst@ifi.unizh.ch)
Mon, 14 Aug 1995 17:02:05 +0200 (MET DST)


Message-Id: <9508141502.AA06947@mocha.bunyip.com>
Subject: Re: Globalizing URIs
To: uri@bunyip.com
Date: Mon, 14 Aug 1995 17:02:05 +0200 (MET DST)
From: Martin J Duerst <mduerst@ifi.unizh.ch>

Being away over the weekend and unable to answer immediately,
I would first like to thank everybody for all the comments on
the problem of globalizing URIs.

This mail consists of three parts: A collection of supportive
comments, an analysis of the critical comments, and some
proposals for further action.


First, I want to thank for the support that quite many of the
contributors have given. Although some of the quotations
and summaries below might be somewhat out of context, it
clearly shows that the view "URLs are not supposed to be
meaningful, so there is no problem" is not justified:

Paul Hoffman:
	No one is paying attention to the fact that URLs should
	not be meaningful (with supportive data).
Roy T. Fielding:
	Meaningful human-readable URIs are easier to correct (by humans)
	than long number sequences with error detection.
	Error correction for long number sequences is too expensive.
Stu Weibel:
	Numbers are not suited to human processing.
Jim Conklin:
	URLs accepted and used because they can be remembered because
	they intentionally contain semantics.
Roy T. Fielding:
	Transcribability (brought by meaningful URLs) responsible for
	the success of the WWW.
Jon Knight, Roy T. Fielding:
	URLs are here to stay. URNs only for resources with long "self-life".
Karen R. Sollins:
	I am very much in favor of human friendly naming schemes,
	in fact lots of them.
John Curran:
	Internet should provide equal functionality to users of
	all languages.


The problems that remain before we can find a good solution
seem to be the following:

1) Using meanig in a name increases the chance that the name
	will become invalid.
	(Karen Sollins, Keith Moore, Paul Hoffman)
2) There should be only one form of URL.
	(Keith Moore, Paul Hoffman)
3) There should be something human-readable, but it should/might be
	on another layer than URIs.
	(Karen Sollins, John Curran)
4) Not the URL scheme itself, but the IETF URLs (FTP, HTTP,...)
	are responsible for introducing semantics.
	(Jon Knight, Keith Moore)

I would like first to adress 1), because here I still have my doubts:
What is the difference between a disappearing telephone number
(no semantics) and a disappearing postal address (semantics).
What kind of redirecting mechanisms are there that can be applied
to the former, but not to the later?
What is the use of trying to achieve "ethernal life" for references
when the referred resources themselves may have a short life?
Isn't the problem rather that semantic and location information
are mixed? For example, a hypothetical URL
	http://www.ifi.unizh.ch/staff/Martin_Duerst.html
mixes both the information that this is a document about me, being a
staff member at the "Ifi" (Institut fuer Informatik, German for Dept.
of Computer Science) at the University of Zurich (ZH being the
abbrevation for Zurich used on number plates of cars) in Switzerland
(CH again being the abbrevation of Switzerland used on cars, standing
for "Confoederatio Helvetica", the Latin name of Switzerland),
and the information that this is a file with some name in some
directory on some host.

As far as what appears on namecards, in newspapers, and otherwise
in human-readable form, it is preferably something semantically
meaningful, with the "location" part of it eliminated.
It is the location part, and not the semantic part, that is
affected when the resource is renamed or relocated.
If the semantics indeed change, e.g. if I move to another university
or company, then there is no need for any such URL anymore,
other than historical. Those who argue that I might want to
have the same URL pointing to my new location are right insofar
as for that purpose, there was too much semantics in my
example. But this doesn't show that no semantics is best, it
just shows that semantically meaningful names, as everything
else, has to be choosen with great care.


The second point above (there should only be one form of
URL) is quite understandable. But first, it should be noted
that even at present, there is more than one representation.
The same URL is represented differently on EBCDIC systems
than on ASCII systems, although it looks the same on both
systems. Nobody has objected to this, but it is very similar
in structure to what I described happening to a "nice" form
of an URL.

Now, the problem is that the second point and the thrid point
don't match together very smoothly. We can't have only one
representation of something if it should be both non-semantic,
machine oriented, and at the same time nicely human readable.

Of course, the two things don't have to be on the same implementation
level necessarily, but I guess they had better be on the same
"definition level". What I mean with this is that, as we have
seen with URLs, just providing one thing and not the other
will not work as intended.

For the sake of naming, let's change the previously used terms
"nice URI" and "raw URI" to URI and HRI (the later standing
for Human Readable Identifier). I have no problem with changing
the terminology in this way, but I would very much oppose
statements such as:
	But, we are talking about something below that here, at that
	lower level, please lets distinguish who OUR clients are.
	They should be programs, not people. (Karen Sollins)
unless there actually is a group that is caring about HRIs
(of which I am not aware). Otherwise, the same as with URLs
will happen, or a new proposal, with numbers only, will just
not be very much used (not on namecards and such, anyway).


As it has been pointed out, some people think that the
individual URL schemes rather than the general design spoiled
the nice intentions of "no semantics". I don't exactly agree
(as these schemes just used the possibility for "spoiling"
built into the general design), but I think it is a point
that could help us out. The proposals I have brought forward
some time ago under the letters B) (Unicode) and C) (server
additions) point into the same direction.

So we could say that we define a new HRI scheme with the
identifier "HFTP", the property that characters from many
languages may be used, and the implementation either along
B), i.e. specifying some form of Unicode/ISO 10646 as the
encoding to be used on the protocol, or along C), i.e. some
(maybe rather trivial) server extension that allows to
determine the encoding that should be used in protocol
requests.

This proposal is along the lines proposing something
such as
	ftp://utf8-ftp.xxx.yyy.zz
just moving the distinction to another place.

Note that in this case, a canonical internal or semi-external
representation (such as the %HH escapes) would not be defined.
After all, it is a HRL, and therefore human readability counts.
Although some people may doubt it, there is no real problem
of ambiguity here. Greeks can enter Greek letters with as
few errors as Americans can enter Latin (ASCII) letters,
and this applies for other places around the world, too.

There could of course be services implementing this new HRI
schemes on top of existing schemes, with proxies and souch.

This would be a start to a distinction of HRIs and URIs, a step
in the right direction without doubt, and a solution to
some of the current problems.

Critics might argue that this is too small a step, and that
it is just perpetuating problems that should be solved more
generally, but anyway, there seems not that much belief in
a better solution that would fully replace the existing one.


Regards to all,		Martin.