Re: Problems I cannot get past with using relative URIs for identity. from Ray Whitmer on 2000-05-22 (xml-uri@w3.org from May 2000)

From: Ray Whitmer <ray@xmission.com>
Date: Mon, 22 May 2000 10:14:17 -0600 (MDT)
To: Tim Berners-Lee <timbl@w3.org>
cc: xml-uri@w3.org
Message-ID: <Pine.GSO.4.10.10005220819280.13149-100000@xmission.xmission.com>
On Sat, 20 May 2000, Tim Berners-Lee wrote:

> >On Thu, 18 May 2000, Tim Berners-Lee wrote:
> >> This is your rhetoric.  Retreival is useful but identity is important
> too.
> >> However,
> >> many schemes such as uuid: and mid: identify primarily and any URI-based
> >> retrieval
> >> is added on afterward.
> >
> >URI's simply do not have the ability to identify resources.  They locate
> them.
> 
> 
> TILT!?!
> 
> If this is the sort of maligning of URIs
> which goes on then no wonder some folks are running scared.  Look at uuid:
> URI.  It is called a Universal Unique Idenifier, and is basically equivalent
> the "guid". It was invented by apollo and identifies things in a lot of
> Domain
> software and Domain RPC, and also in Microsoft systems. It is constructed
> using
> a unique number for a machine typically from an ethernet card and a
> timestamp
> and a random number, I think.  It is typically used to identify program
> module
> interfaces, local or remote.  (Very like namespaces in someways).
> All it gives you is identity.  You can't look them up in any global system.
> You might have a registry on your local machine of ones you have loaded.
> You might use a broker. In what way do say that this is a location?
> There seems to be a huge disconect of vocabulary or something.

I did not intend to malign less-known protocols that, unlike http, are designed
as identity codes.  If people were using these, the discussion of relative paths 
would be a moot one, because to solve the identity problem well, they probably 
discard most ideas of relative meaning or resolution.  Identity is not about
relative navigation.  If something is based upon relative navigations, it often
has to appeal to some other system to determine whether identity is now the
same.

Protocols like http, https, ftp, etc. locate resources.  They do not identify
them.  This is the reason, as nearly as I can tell, that SGML had the concept
of a public ID, combined with a system ID.

Namespaces seem to be the only production in XML that tries to make a
system identifier stand on its own, rather than using a public identifier
to identify the resource and make it work well with a catalog.

Perhaps the proper solution in the case of namespaces is to allow a declaration
to specify both a public id, which then CAN be used for purposes of identification,
and a system id.  This doesn't solve the issue of what the system id points to,
but those who are concerned with this at least then have the globally-unique 
string they need.

> The sort of web OS you run has nothing necessrily to do with the URI space
> you provide.
> You don't not have to map HTTP names directly to file names!
> (sometimes I wish I had restricted the elements of a path to digits! ;-)
> For example, Apache can be set up to do case insentivity on unix.

In practice, URIs are a mapping onto this local space.  The point here,
however, was that the spec allows, even encourages multiple URIs to map onto
the same resource, because all the users of the spec are interested in is
retrieving resources.  If you were interested in identity, then allowing a
case-insensitive mapping onto a case-sensetive system would be an obvious
problem at both ends -- on the client, it leaves the client wondering whether
these two URIs represent the same or distinct resources, and on the server, 
it leaves the server wondering which resource to return in the case of
collisions, and, again, leaving the client wondering which resource was
really returned.

> The function of "identity" is that if you and I refer to a resource by a
> particular
> value of the same (absolute!) URI then we are talking about the "same"
> resource where "same"
> is a function we could discuss at length, but I don't propose to. Suffice to
> say that URIs give you whatever answer you need depending on the
> engineering requirements for your particular situation.

They do not.  They lack the characteristics.  Namespaces tried to fix that
by specifying an exact string comparison, only to have it challenged that
just because the spec allows alternative ways of referring to a resource
it must be supported.  This will make people more reluctant to rely on
URIs in the future for identity, I think.

Layers or specifications would have to be created on top of the 
specification to give good identity characteristics, which seem to be 
incompatible with the existing web's use of URIs.  Identity is NOT location.  
Paths are about location in traditional file systems, not about identity.

Many acknowledge how browsers that allow sloppy HTML make it so that far
fewer produce proper html.  Well, with URIs, the case is worse, because 
there is not a "validator" you can run on a URI to distinguish the proper 
ones from the bad ones, nor have most users of URLs made the distinction,
if you are trying to use them for identity.

The web would be better off in some cases if the URI lookup for web sites 
were more catalog and identity based (like SGML public IDs), rather than 
location based, because locations change so often.  I do not think that 
requiring support for relative URI paths helps this identity crisis.  I 
think the way to properly do identity is to keep it separate from location.

In my previous uses of XML for ecommerce on the web, the public ID was all
we ever really cared about, because any required validation was coded into 
the application recieving the information, which couldn't be changed or the
application would break.  As such, the DTD or schema contents never
contained interesting information for anyone but the programmer.

Suddenly saying that the identity is less important than the location breaks
lots of the value of XML for simple applications like this.  I know that
you would prefer to say that URI satisfies identity at the same time that it 
satisfies location, but I disagree.

> It is unambigiuous in that a given URI only refers to one rsource.  But many
> URIs
> may refer to the same resource.  This is always the case.  Even with FPIs --
> ISO
> can make an identical standard out of w3C's xHTML, but give it their own
> FPI.
> That is life. Someone can give another name to your resource.
> In practice of course for standard namespaces (and most resources) everyone
> agrees on
> an exact URI to use as it is stated in the standard.

I agree that it can detect some cases where identity is likely to be the same,
only based upon the principle that no two identities can occupy the same
location at the same time.

But it is not a standard designed to deal well with identity, because simple
shifts in location can easily cause the identity to be thought to be different
when it is not

Relative paths make the problem more difficult, especially when you say that 
the path of the document may form part of the URI.  Often, the path of the 
document is delivered to you by some other part of the system, which has no 
idea that the path is being used for any purpose other than to save / 
retrieve the document.

The MAC OS, for example, I believe requires you to use a binary identifier
to globally find a file, because the paths often change.

There are often many possible representations to choose from.  To think
that someone discovering a document through a mechanism where relative paths
were not defined or operated would have everything break is sad.  You have
introduced OS dependencies.  A document does not contain its own name, so
it has no control over the mechanism that will be used to retrieve it, or how
relative paths will be resolved.

> Ray, absolutization is a well defined string function. Period.
> You can talk aboyut cannonicalization or serverside knowledge of URI
> equiavelnce but they are NOT covered by "absolutization".

It would have made my arguments more clear had I referred to a more
complete process as "cannonicalisation", even though  the word 
"absolutization" is not defined in the RFC, hence there have been many 
discussions of what it meant.

Whem people suggested it should form a basis for identity, then the
discussion naturally became, what did it need to do, if it were to do
that.  What is described in the RFC does not seem to be an adequate basis 
for identity.

> >> You are using the term absoluization in a manner different from the way
> it
> >> has been used on this list.  I have seen no one argue for involving the
> >> server in the processes. Many URI schemes don't have a concept of a
> server.
> >
> >The point was, without the server to do some canonicalization, the
> >absolutization described in the RFC is not sufficient to say whether two
> >URIs identify the same resources or different resources.
> 
> You never can, in general.

The reason you have seen noone arguing for involving the server in the
process is because it is impractical.  Those who understand this issue are
arguing for an exact string comparison which can be used to establish 
identity or not, and place the location of processing-specific resources
into a seperate attribute.

Identity was not a requirement of URIs, so naturally it is not sufficient
to establish the identity (excluding any protocols for which this WAS
a requirement).  Where identity IS a requirement, it is quite possible.

By trying to meld these together in a situation where identity was never 
a requirement before, you hurt both identity, and the ability to retrieve
the correct resource.

> I never insist that they represtent a retrieval pointer rather than an
> identity.
> I always say they represent an identity.  Sometimes I shout it sometimes
> I just cry quietly.

If that were the case, then the catalog-based public identifiers of SGML
should have been chosen instead of something that almost always has mapped
onto raw filesystem, which is not about identity.  When you move a file
its identity does not change, but its path does.

> HTTP is a protocol for using the identity provided by URIs to actually
> tell you about the resource.  It is like a big catalog.  (If I wanted to
> ambush your FPI and turn it into a nasty referal pointer I would send you
> a catalog file though the mail.  Then you would have a way of derefeencing
> it.
> And you'd have to think of a whole new naming system because that one
> was broken.  Or, you would have to admit that the fact that some things
> *can*
> be looked up doesn't change the fact that they are identifiers.

http is what it was designed to be -- hyper text transfer protocol.

If I thought you were going to ambush me using such a catalog, I would
insist on using the public identifiers to access the resources, not the
system identifiers.  They are system identifiers, precisely because they
may be specific to the system.  In practice, it is just an expanded 
filesystem.

> Ah, I see where you are coming from I think.  You argue that because
> different
> people quite rightly want to do differet things with the data, then the
> syntactic
> constraints will be different.   But in fact, whatever someone does with the
> information carried in a document, ther are some constraints on the language
> in which it is written which are absolutely a property of the language.
> You can use an xHTML file as wallpaper, but it still can't nest headings
> and be an xHTML file.  (The same applies to semnatics)

No.  Only the way of describing the contract will be different and of varying
significance.  Most simple specifications I have seen are not described 
completely by DTDs or schema languages, and rely on the fact that they are
exchanging data with other applications that always produce data within the
limits of the specification, or data that by the specification can be 
encapsulated and passed along easily.

> I felt you are trying to sway this group against URIs by indicating that
> others have found them inadequate, and so I wouldlike you to justify that.

For identity, yes, if reference to the RFC suddenly overrides explicit 
language that identifies exact string comparison as the method for
determining equality, for stated reasons.

Ray Whitmer
ray@xmission.com
Received on Monday, 22 May 2000 12:14:24 UTC