Re: Problems I cannot get past with using relative URIs for identity. from Tim Berners-Lee on 2000-05-20 (xml-uri@w3.org from May 2000)

From: Tim Berners-Lee <timbl@w3.org>
Date: Sat, 20 May 2000 12:58:41 -0400
To: "Ray Whitmer" <ray@xmission.com>
Cc: <xml-uri@w3.org>
Message-ID: <000001bfc27c$ceeb17b0$29eb5c8b@ridge.w3.org>
-----Original Message-----
From: Ray Whitmer <ray@xmission.com>
To: Tim Berners-Lee <timbl@w3.org>
Cc: xml-uri@w3.org <xml-uri@w3.org>
Date: Friday, May 19, 2000 2:34 AM
Subject: Re: Problems I cannot get past with using relative URIs for
identity.


>On Thu, 18 May 2000, Tim Berners-Lee wrote:
>> This is your rhetoric.  Retreival is useful but identity is important
too.
>> However,
>> many schemes such as uuid: and mid: identify primarily and any URI-based
>> retrieval
>> is added on afterward.
>
>URI's simply do not have the ability to identify resources.  They locate
them.


TILT!?!

If this is the sort of maligning of URIs
which goes on then no wonder some folks are running scared.  Look at uuid:
URI.  It is called a Universal Unique Idenifier, and is basically equivalent
the "guid". It was invented by apollo and identifies things in a lot of
Domain
software and Domain RPC, and also in Microsoft systems. It is constructed
using
a unique number for a machine typically from an ethernet card and a
timestamp
and a random number, I think.  It is typically used to identify program
module
interfaces, local or remote.  (Very like namespaces in someways).
All it gives you is identity.  You can't look them up in any global system.
You might have a registry on your local machine of ones you have loaded.
You might use a broker. In what way do say that this is a location?
There seems to be a huge disconect of vocabulary or something.


[...]
>Perhaps you would like to play a game of trying to guess whether several
>URIs really identify the same or different resources.  URIs are interpreted
>by servers which have a lot of freedom to establish many-to-many
relationships
>between URLs and the resources they retrieve.


The publisher - owner - of a name can indeed _in principle_ abuse it by
using it for one thing
one day and soemthing else the next. And of course some things
like the current weather at Logan airport or the xml-uri archive list are
justifyably living
documents.  They change in value while the abstract resource stays
identified by the
same URI.   This is a good way to represent the xHTML namespace, wich is
defined
(by the HTML WG) to be whatever is in their latest spec.  They undertake to
be aware
of bck-compatability issues, they say.

The HTTP specification does not mandate particular polices.
Another spec could mandate it on top of HTTP. For example, P3P dtermines
that when URI is used
to refer to a privacy policy then that document may never be changed.
The act of using P3P commts the publisher to persistence.


>> The RFC defines how things actually work which is what users are all used
to
>> (and happens to basically match algoirthms used in unix for decades).
>
>In some cases yes, in some cases no.  It encourages case insensetivity, for
>example -- it wasn't clear to me that it mandated it.

In HTTP, the path is case sensitive as far as the client goes. Many servers
will return
identical information about two resources which differ only in case.

>Unix does not allow for it.

The sort of web OS you run has nothing necessrily to do with the URI space
you provide.
You don't not have to map HTTP names directly to file names!
(sometimes I wish I had restricted the elements of a path to digits! ;-)
For example, Apache can be set up to do case insentivity on unix.

>This means that the client does not know whether the case may distinguish
>different resources or not.

The client will never be certian that two resources are not the same unless
it has discovered inconsitent facts about them.  You can't tell from a name
that two things are differet - only hat they are the same.

> It does not dictate a character set for escaped
>characters, so that upper/lower case equivalence could be established if it
>clearly dictated that.


The function of "identity" is that if you and I refer to a resource by a
particular
value of the same (absolute!) URI then we are talking about the "same"
resource where "same"
is a function we could discuss at length, but I don't propose to. Suffice to
say that URIs give you whatever answer you need depending on the
engineering requirements for your particular situation.

>saying http: versus https: on the same server may or may not identify
>equivalent resources, but generally only changes the procedure for
accessing
>the resource.

I agree https: URI space has that problem.  It should have been a negotiated
optoion on the http protocol so that hte name of a resource didn't change if
you wanted tto access it scurely.  (https: is not a W3C spec.!)

>It does not identify equivalence of symbolic links and other
>Unix things that may make two URIs equivalent.  It does not deal with
caches
>which may mean that the location the file is retrieved from may not be the
>true identity, and so on.  It represents a retrieval algorithm, not a
unique
>identifier.


It is unambigiuous in that a given URI only refers to one rsource.  But many
URIs
may refer to the same resource.  This is always the case.  Even with FPIs --
ISO
can make an identical standard out of w3C's xHTML, but give it their own
FPI.
That is life. Someone can give another name to your resource.
In practice of course for standard namespaces (and most resources) everyone
agrees on
an exact URI to use as it is stated in the standard.

>> I assume this idea of "partial absolutization" is based on the
misconception
>> above. There is no such thing defined. There is only absolutization.
>
>On the contrary, it came up many times during the discussion how much
>the client would be required to do.


Well, the idea is not defined in the URI specs. It doen't exist.

>> Absolutely not!  It is essential that the asolutization can be done as a
>> simple string function without any knowledge of any specific schemes.
>
>Requiring that does not give enough knowledge of the specifics of a
>protocol to make it possible.


Ray, absolutization is a well defined string function. Period.
You can talk aboyut cannonicalization or serverside knowledge of URI
equiavelnce but they are NOT covered by "absolutization".

>> You are using the term absoluization in a manner different from the way
it
>> has been used on this list.  I have seen no one argue for involving the
>> server in the processes. Many URI schemes don't have a concept of a
server.
>
>The point was, without the server to do some canonicalization, the
>absolutization described in the RFC is not sufficient to say whether two
>URIs identify the same resources or different resources.

You never can, in general.

>This point was
>brought up in several discussions I had, and always represented as a
trivial
>form of absolutization that should occur, not performing most mappings of
>a URL server to identify the actual resource to return.
>
>> On the contrary, very often systems aloow separate access to the actual
>> attribute value
>> and to the interpreted object.  xHTML editors typically preserve the
choice
>> of relative or absolute URI used in a A HREF, but adjust the value when a
>> copy of the document is saved to a new URI.  I would suggest that this
>> behaviour be the norm.
>
>I should have said, the primary / winning meaning.  As I went on to
describe,
>we already have one such duality in namespaces, which causes a fair amount
of
>complexity and disagreement -- the prefix versus the URI.


>> > What is preserved when nodes are
>> >moved within the hierarchy -- the absolutized concept of type that is so
>> >important for specialized DOM implementations, or the relative info?
You
>> >could choose:
>>
>> You are talking about an XML node moving within the DOM tree?
>> Or a document being moved through URI space?
>> In all cases the absolute URI should be presrved.  To preserve the choice
of
>> relative URI or absolute URI is important in practice too.
>> I am hoping that the same code is going to be used for all URIs of
course.
>
>But when you serialize, only one can be preserved, unless they happen to
>exactly coincide.

If a link from a document is move, then clearly the value your get from
absolutizing the URIref should be the same before and afte the move. It
should be represented as a relative after if any only If it was represented
as a relative URI before.  That is what I mean.  It is of course really good
not to "move" documents.

>What is the point of having used relative URIs, if
>programmatically moving it to a new location does not cause the resolution
>to be different?  I thought that was the point, no?


Well, it depends on whether you move resource pointed to of course. I was
assuming you only move one resource.

[..]
>> Please, if a server returns something when queried for the namespace
>> resource,
>> then that is now an arbitrary resource. The server is controlled by the
>> owner of the
>> name and the document returned is therefore definitive. It is not
arbirary.
>
>The point of XML to me (and SGML before that) is many processing models for
>the same information.  Schemas, DTDs, and other syntactic controls tell
>nothing about the meaning, and may compete as other standards do.  It is
>unreasonable, IMO, to expect to find one particular one at the end of a
>namespace declaration, or to expect most applications to pay attention to
>it.  Only extremely-general purpose applications might look at it.  Most
>only care about transforming in and out of objects with more specific /
>proprietary / local meaning.  If someone changes the syntax, they can
>decide what to do with the unrecognized part: ignore it, pass it along,
>or raise an error.  These non-general specific-purpose applications are
>the ones that interpret the syntax with meaning.  Hence, the transform is
>far more important to the processing recipient than the schema, and the
>proper transform to lend local meaning will be different for different
>recipients.
>
>> To use java class names I say what I said about FPIs.  If you really
think
>> they solve problems
>> other URI schemes don't solve, then you can propose that they be a new
URI
>> scheme.
>> Then all new designes will be able to take advantage of them.
>
>But then you might insist on absolutizing them, or otherwise insisting that
>they represent a retrieval pointer rather than an identity.


Absolutization has nothing to do with retrieval.  You have to absoltize a
URIreferece
before yoiu can do anything with it.

I never insist that they represtent a retrieval pointer rather than an
identity.
I always say they represent an identity.  Sometimes I shout it sometimes
I just cry quietly.

HTTP is a protocol for using the identity provided by URIs to actually
tell you about the resource.  It is like a big catalog.  (If I wanted to
ambush your FPI and turn it into a nasty referal pointer I would send you
a catalog file though the mail.  Then you would have a way of derefeencing
it.
And you'd have to think of a whole new naming system because that one
was broken.  Or, you would have to admit that the fact that some things
*can*
be looked up doesn't change the fact that they are identifiers.

>> >And what do you do when multiple purposes conflict?
>>
>> What multiple purposes, and can you give an example of conflict?
>
>I just did.  The schema is largely irrelevant.  It is the transformation
>that is important, which the schema gives little insight about.  Different
>processing models require different transformations.  There are some who
>would like to dictate a single processing model for all data.  They have
>been doing it for years.  AKA the war of proprietary formats.  But they
>have failed to make their solutions reusable because no two sets of
>requirements are equal.


Ah, I see where you are coming from I think.  You argue that because
different
people quite rightly want to do differet things with the data, then the
syntactic
constraints will be different.   But in fact, whatever someone does with the
information carried in a document, ther are some constraints on the language
in which it is written which are absolutely a property of the language.
You can use an xHTML file as wallpaper, but it still can't nest headings
and be an xHTML file.  (The same applies to semnatics)

>> Oh, you mean that one schema document should not contain syntax-related
and
>> semantic information?  That doesn't seem to be a problem to me.
>> Specification douments do.  Of course
>> one could separate them and refe to one from the other.
>
>I am not convinced that one document can even represent all desired
syntactical
>restrictions.  People keep inventing new microparsers for things that XML
is
>considered too verbose for, like xpath, svg, many url protocols, even
things
>like currency and dates are microsyntaxes, and could be easly reduced to
>simpler things like integer attributes if people were not too inventive of
>shorthand syntaxes.


No one I have heard is suggesting repreenting *all* the desired syntactical
restrictions.
A schema gives you one set and it may miss some.  There may some
restrictions
which it is not pwerful enough to express. As you say, the substructure of
the
attribute values is one.

>Are you anticipating creating a meta-schema that transforms all
microsyntaxes
>into XML, and then somehow makes all forms equivalent?  Equivalent to what?


No

>And even then, you are a long ways from giving things meaning.


Indeed, a long way.

>> Could you plese indicate the standards groups which have run screaming
from
>> URIs?
>
>Look around you.  I will not point to them publicly right now.


You could mail me privately if you like.
I felt you are trying to sway this group against URIs by indicating that
others have found them inadequate, and so I wouldlike you to justify that.

>> I would point out that the web is humming with relative URIs which lend
>> managability to huge amounts of the data.  Our own website would be
>> very much more difficult to manage without them.
>
>W3c would have to make do like everyone else if the website supported
dynamic
>data relying on multidimensional parameterized URI syntax, which is where
>much of the web is going, rather than file-system-like URIs.  Many go to
>great lengths writing relative navigation schemes that have nothing to do
with
>the URI RFC, because it deals only with trivial name hierarchies and
discards
>parameters.


Indeed, the smart handling of matrix URIs (those which take multiple
parametrs a la
http://www.example.com/foo;x=1;y=2;z=3) would have dne just that.  They
didn't
make it into the spec. I bemoan the fact in
http://www.w3.org/DesignIssues/Axioms.html#matrix
but there you go - it did not get into the spec. It would have made the
relative URI absolutization algortithm more
complicated - but it would have made those multidimensional websites much
easier!

>For example, you might have different parameters describing:
>
>The working group
>The stage of the work
>The publicness of the work
>The type of resource
>The user's desired medium (html, xml, cell phone markup, etc.)
>The user's language
>The user's locale
>
>It is generally desirable to be able to switch any of these relatively
>while holding the others constant.  But the URI scheme which would force
>relativism to occur within a single hierarchy would force you to give
>each item priority rather than allowing them to be equal dimensions, and
>discard everything of lesser priority.  This makes the RFC-based
>relativism unworkable for common uses.  It was designed for legacy file
>systems, which sometimes can capture part of the identity of a resource.


I wonldn't write off hierachical systems as "legacy".   A lot of people use
trees for a lot of things, even these days.  Somtimes trees are better than
matrix systems for the job.  But I understand your frustration.  It just
didn't
occur to us all until it was too late to introduce because of the number
of deployed browsers which would get it wrong.

Tim BL

>Ray Whitmer
>ray@xmission.com
>
Received on Saturday, 20 May 2000 12:57:55 UTC