Re: Web Resource Identity

From: Henrik Frystyk Nielsen (frystyk@w3.org)
Date: Sun, May 30 1999


Message-Id: <3.0.5.32.19990530141822.0302e970@localhost>
Date: Sun, 30 May 1999 14:18:22 -0400
To: Paul Prescod <paul@prescod.net>, "xlxp-dev@fsc.fujitsu.com" <xlxp-dev@fsc.fujitsu.com>, xml-dev <xml-dev@ic.ac.uk>, lavoie@oclc.org
From: Henrik Frystyk Nielsen <frystyk@w3.org>
Cc: www-wca@w3.org
Subject: Re: Web Resource Identity

At 10:01 28/05/1999 -0500, Paul Prescod wrote:

>It is encouraging because it is long needed.

Great to hear!

>It is disturbing because I
>believe it identifies a key problem with the Web (or with my understanding
>of the Web). 
>
>This document refers to the URI specification in its definition of
>"resource": "...anything that has identity." This is troubling because
>there is no definition of identity. In the HyTime and object oriented
>worlds, I believe that the defining characteristic of things with identity
>is that you can take two references and determine if they refer to the
>same object.
>
>I do not see how to do this on the Web. Consider the following URLs:
>
>http://www.mitre.org/index.html
>http://www.mitre.org/
>http://www.mitre.org
>
>Do they refer to the same resource? Let's try the answer both ways:

The only way these resources can at all be considered to be related is if
there is an explicit relationship that describes their exact relationship
(in this case what exactly "same" means). In most servers, this is done in
a global config file and is known by the publisher but not by anybody else.
Metadata can be used to describe these relationships in a way that is
accessible to parties (not necessarily the whole world) outside the local
server serving the resources. 

>Summary:
>
>I believe that the Web needs a concept of a canonical URL, if it doesn't
>already have one. Retrieving a document or the HEAD for the document
>should describe the canonical URL. I wouldn't mind if the canonical URL
>was a totally unreadable UUID as long as I can take two URLs and figure
>out whether they refer to two things that happen to have the same content
>or actually refer to the SAME THING.

I think it is important to realize that the canonical (or generic) URI
doesn't have to be linked to the syntax of the URI - it is a question of
how the resource it identifies relates to the rest of the world. In your
example above, it could be either of the names that is considered the
"generic URI":

	http://www.mitre.org/index.html
	http://www.mitre.org/
	http://www.mitre.org

Note, btw, that the last two examples are equivalent at a syntactic level. 

This is of course also has to do with trust - which would you most likely
trust to provide the authoritative W3C host page URI among these URIs
(where 'none' is a valid answer):

	http://www.w3.org/
	http://w3c-mirror.some.host/
	http://another.host/w3c/

Without a mechanism for identifying who the authoritative publisher is, it
is hard to talk about a generic URI. This is the reason why we in the WCA
terminology draft [1] define a publisher as "The principal responsible for
the publication of a given resource and for the mapping between the
resource and any of its resource manifestations".

So, in summary, I would argue that the concept of a generic URI is useful
but that it isn't related to syntax but rather relationships and trust.

Henrik

[1] http://www.w3.org/1999/05/WCA-terms/
--
Henrik Frystyk Nielsen,
World Wide Web Consortium
http://www.w3.org/People/Frystyk