Re: URL +1, LSID -1 from Alan Ruttenberg on 2007-07-15 (public-semweb-lifesci@w3.org from July 2007)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Sat, 14 Jul 2007 22:17:27 -0400
To: Eric Jain <Eric.Jain@isb-sib.ch>
Cc: wangxiao@musc.edu, Michel_Dumontier <Michel_Dumontier@carleton.ca>, public-semweb-lifesci <public-semweb-lifesci@w3.org>, Mark Wilkinson <markw@illuminae.com>, Benjamin Good <goodb@interchange.ubc.ca>, Natalia Villanueva Rosales <naty.vr@gmail.com>
Message-Id: <A4A72177-A5D1-4537-A13B-BF9FF30DC6D5@gmail.com>
Summary: This is a technical discussion in which respond to various  
points that Eric makes in his message regarding the utility of using  
PURLs, which I place in the context of making statements on the  
semantic web. Comments are in line with the original conversation  
because they refer to specific passages of his original message.

[ I'm going to be experiment with including short summaries at the  
top of long messages at the suggestion of some colleagues who need a  
better way of prioritizing emails]

Hi Eric,

On Jul 14, 2007, at 10:26 AM, Eric Jain wrote:

> Alan Ruttenberg wrote:
>
> I'm not at all saying that you wouldn't want to attach any  
> statements to a specific representation, but that if you did, you'd  
> better use the actual URL of the representation, not some PURL.

The point of having the PURLs  is to ensure that there is a mechanism  
for handling three cases that LSIDs were intended to address (but  
which can be addressed without the trouble of introducing a separate  
resolving mechanism)
1) To be immune from the "actual URL of the representation" changing.  
(e.g. beta.uniprot.org goes out of beta)
2) To enable switching to a backup if the server is turned off, or  
certain pages go 404
3) To facilitate local caching of content from servers such as  
uniprot in such a way as to not adjust what URLs clients need to use  
to access this content.

Many of us who have worked in the field have seen (and been burned  
by) variants of these cases over the years.

> For example, if you wanted to state that http://beta.uniprot.org/ 
> uniprot/P12345 validates as XHTML 1.0:
> <http://beta.uniprot.org/uniprot/P12345>
>   validatesAs <http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd>
> ...seems better than e.g.
> <http://purl.org/commons/html/uniprotkb/P12345>
> ...of which you can't be quite as certain that it will always point  
> to the specific page you wanted to describe!

It is better, because when http://beta.uniprot.org/uniprot/P12345  
switches to http://uniprot.org/uniprot/P12345, as you suggest will  
happen, we (by this I mean the imagined community that administers  
the purls in the interest of serving HCLS informatics)  will update  
the redirect to point to the new URL. When the "beta" is dropped,  
either it will be the same page as before, in which case all our  
statements will still be valid, or the page will change, in which  
case using uniprot direct address doesn't help. The intention of  
setting up a PURL system by our community would be to not only  
blindly manage redirects, but to set (and represent) expectations of  
what a client can reasonably expect the behavior of a fetch of any of  
these URLs to be. It will always be up to uniprot to decide what they  
put on html pages. But within our community it would be considered  
good manners to explain what uniprot's page update policy is (or  
isn't), and then live up to what they have said.

> You could argue that this URI is meant to represent the more  
> general concept of "an HTML representation of P12345", but at that  
> point I really start to wonder...

This has not been argued. The closes thing along these lines that has  
been argued for is the definition of
http://purl.org/commons/record/uniprotkb/P12345 which is intended to  
represent the underlying information in the database record without  
commitment to a particular format (xml, rdf, html). This URI would be  
used for making statements about aspects of this information that are  
common to any format (e.g. this record includes a representation of  
an amino acid sequence). That /record/ uri is intended not to be an  
information resource, to 303 to an RDF document describing how to  
access the specific formats, etc, as described in other emails.

>> I suppose it might be possible to represent which header should be  
>> used in the content negotiation as part of the RDF, but a) It's  
>> got to be easier to just put that information in the name and b)  
>> In the case that you want to, e.g. mirror some contents of Uniprot  
>> on a file system, you will have to make up distinct names anyways?  
>> Maybe I'm dense, but I fail to see how content negotiation is of  
>> any use on the semantic web.
>
> Note that the content negotiation is done *at the level of the  
> resolver*, all the different representations have their own URLs:

Yes, but what sorts of statements can be made using http:// 
purl.uniprot.org/uniprot/P12345 as the subject? Because it can mean  
any of the below, even the protein class itself, how can a *semantic  
web* statement be made using it?

> http://beta.uniprot.org/uniprot/P12345
> http://beta.uniprot.org/uniprot/P12345.xml
> http://beta.uniprot.org/uniprot/P12345.rdf
> http://beta.uniprot.org/uniprot/P12345.fasta
>
> Content negotiation could be a useful mechanism for bypassing the  
> HTML representation (which is what the PURL resolves to by default,  
> greatest common denominator etc), important if a lot of requests  
> need to be made.

The issue of efficiency of requests is a separate issue, but you  
aren't the only one who has mixed up the issue of efficiency with  
clarity - I had a conversation with TBL a couple of weeks ago where I  
argued that the whole hash uri thing was another such case - a  
premature optimization.

IMO, the first goal of our design ought to be to ensure that  
automated semantic web agents (idiots as they will be) will have a  
fighting chance to avoid having to do the difficult (even impossible)  
sorts of disambiguations that people are faced with all the time.  
That bar hasn't yet been met. Once we've ensured that we can meet  
that goal, then we can talk about optimization. (incidentally we do  
discuss various optimization techniques, from predicability of the  
form of the name, to purl servers sending back the rewrite rules they  
use so that they can be implemented on the client side).

> You'll notice that in the RDF representation, this HSSP resource is  
> represented with the URL http://purl.uniprot.org/hssp/7aat. The  
> main reason for pre-resolving the PURLs in the web pages is that  
> many people (been there, done that) like to see where they are  
> going before they click.

OK, I missed that. But I'd still use the same purls in the HTML.  
There are other mechanisms for indicating the real destination, and  
it may lead to confusion when people need to choose a name to the  
subject or object of a statement. If you ever land up using RDF/a,  
you will need to use the same URIs as in the RDF.

> btw this is an example of a resource that won't work with the HCLS  
> PURL resolver at the moment, as this resolver can also only append  
> to a path!

So that you know, there is interest by the purl developers to extend  
their redirect service to better accommodate semantic web usage, and  
they have offered to do this if we can get together and tell them  
what we need.

Regards,
Alan
Received on Sunday, 15 July 2007 02:17:32 UTC