Re: Handles and PURLs from Kevin Smathers on 2003-05-22 (www-rdf-dspace@w3.org from May 2003)

From: Kevin Smathers <kevin.smathers@hp.com>
Date: Thu, 22 May 2003 14:07:58 -0400 (EDT)
To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
Cc: "(www-rdf-dspace@w3.org)" <www-rdf-dspace@w3.org>
Message-ID: <3ECD12F4.5000702@hp.com>
Butler, Mark wrote:

>5. Due to 3, URIs tend to mix identity and version (i.e. date, time). There
>are some disadvantages to mixing these two different axes, particularly as
>different URIs mix them in different ways so they are not algorithmically
>separable. Perhaps it might be useful to separate these axes, as then it
>would be possible to determine from the URIs alone that two resources are
>versions of the same thing. Now this is controversial, as we've already
>discussed an opposing view e.g. identifiers must be random. But from the
>CC/PP work, I'm concious things are much easier for processor developers as
>this may be easier than keeping track of a bunch of metadata that says all
>these identifiers refer to versions of the same resource. For more details
>see
>http://www.hpl.hp.com/techreports/2003/HPL-2003-31.html
>  
>
Genesis' position is that it isn't identity and version that are 
conflated, but Identity and Content.  By 'Identity' and 'Content' I mean 
the same distinction as the semweb distinction between 'stating' and 
'statement'. The 'Identity' incorporates the concept of an instantiation 
and possible metadata such as date and time (your version), but also 
other metacharacteristics like owner, access permissions, and trust 
among others.  In contrast to Identity, Content is data divorced from 
the context of its instantiation.  Content based identifiers such as SHA 
hashes indicate the data that is in a document, but give no information 
about where that data came from, who owns it, or if there is a more 
recent version.

In Genesis, rather than rely entirely either content based identifiers, 
or resource based identifiers, we instead combine the two; a genesis URI 
looks something like:

  genesis://host/genesis-server/resourceid;contentid

This sets up a means of syntactic transformation of URIs; if the program 
that is retrieving the URI only needs the data contained in the 
reference then the genesis id can be transformed into a content based 
identifier:

  hdl:sha1/contentid

Using this identifier the content of the specified document can be 
retrieved from any peer that happens to be able to respond to the 
content hash, allowing document contents to be widely mirrored 
throughout the network.

If the application that is retrieving the data has some other interest 
in the document, such as whether that document ever in fact had those 
contents, then the full genesis id can be converted to a URL for 
retrieval from its canonical owner.

   http://host/genesis-server/resourceid;contentid

Analogously, if the application is uninterested in the specific version, 
and only wants to retrieve the contents that were most recently assigned 
to the resource identifier, then the contentid can be dropped.

   http://host/genesis-server/resourceid

It is our opinion that RDF resource references should be listed in the 
combination form, that is as full genesis identifiers, since the RDF 
creator will have no way of predicting which of these uses a specific 
application will have for its resource references, with the exception 
that links into the future (for which there is no content at present) 
should be expressed by their resource identification.

>6. The concept behind PURLs and Handles is good, i.e. when a resource moves
>you don't need to worry about it. DNS already has a level of indirection
>built in, so why not do this for retrievable resources? This is discussed in
>the Stone paper cited above.
>  
>
There are multiple ways to solve 404 errors, including (among others) 
URL forwarding, and DNS updates.  I can't see any obvious reasons why 
handles should be considered more long-term retrievable than URLs are.  
Perhaps someone can explain.

Within the domain of URI's, if the custodian of the URI doesn't want to 
maintain its linkage over time (e.g: domain name gets taken away, 
company goes bust, etc.) then  one must rely on higher level social 
abstractions.  A new web site replaces the old one; update your links if 
you care about retrievability.

My problem with URN's or Handles is that I don't see any mechanism for 
arbitration.  What keeps someone from stepping on your namespace and 
allocating invalid or conflicting identifiers.  CORBA style UUIDs 
(Windows GUIDs?) fall prey to malice and stupidity.  And content based 
identifiers can only identify content, not instance.


-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");
Received on Friday, 23 May 2003 03:16:25 UTC