Re: URIs from Sean Martin on 2006-06-16 (public-semweb-lifesci@w3.org from June 2006)

From: Sean Martin <sjmm@us.ibm.com>
Date: Fri, 16 Jun 2006 10:39:40 -0400
To: public-semweb-lifesci@w3.org
Cc: Alan Ruttenberg <alanruttenberg@gmail.com>
Message-ID: <OFE12D5E5E.6F954138-ON8525718F.004D99E7-8525718F.00508D83@us.ibm.com>
Hi Alan,

AR>    b) The URI is used primarily as a name. Insofar as we want use 
AR> names, it is important there be some stable URIs. Of course it 
AR> doesn't hurt if the URI becomes dereferenceable at some point, and it 
AR> would even be nice, 

AR>    d) Any URL we use needs to be able to be dereferenced to the thing 
AR> it is (and not dereferenced if you can't do that). It's only meaning 
AR> is what it dereferences to.

In systems we are building for wide area collaborative biomedical research 
we often use URIs to name things that have a digital existence. These 
include multi-modal images like slide scan's, x-rays, MRIs, CAT, 
simulation visualizations and the processed results of analysis of these 
images, spreadsheets with lab experimental results, ?in silico? 
experimental results from simulations and intermediate analysis, the 
modeling program code and their SBML/CellML representations and so on.  We 
uniquely and unambiguously name these objects so that they have the same 
name wherever they are copied or cached (which of course they are all the 
time given wide area collaboration is our purpose) and a string comparison 
on the URI is enough to determine binary equality of the object named.

In addition we store multiple metadata statements about these named 
objects, like who created them, when, why, what are they (format and 
context), from what are they derived, who else is referencing them, what 
do they say about them etc. For this we use the URI in RDF graphs that 
generally interconnect large networks of these named objects since they 
are very rarely useful in isolation. 

By querying or searching our systems we can establish the purpose and 
entire provenance graph of any data object as well as its relationships to 
all the other objects and concepts we know about. Sometimes these 
relationships can be established or mined automatically using inference or 
search techniques.

We use the URI for accessing a copy of any named object when ever needed 
either from our local cache or from the place(s) where it came from if it 
is still available or a third parties cache if not ? perhaps to view or 
annotate it, perhaps to automatically process it further, perhaps in a 
workflow pipeline or out at a GRID processing node. The in silico 
experimental workflows themselves may be expressed (e.g. OWL-S)  in URI 
named RDF documents a.k.a "named graphs" and after their execution become 
the back bone of that experiment?s provenance graph including URI pointers 
to results.  The metadata associated with objects often provide a semantic 
context to aid in the automated impedance matching of intermediate steps 
in experimental workflows (i.e. a step above content format negotiation) 
either at experiment design time or during execution.

Third party collaborators/users of our system?s data can unambiguously 
annotate objects and metadata in our systems by creating their own locally 
held metadata graphs as well as are able to take and hold local copies of 
the named objects without concern that the underlying references will 
change. URIs can be dereferenced to retrieve the named data and/or the 
associated metadata using multiple transport protocols and service end 
points. 

In my view it would be a mistake to tie all possible legal RDF URIs (its 
just a name!)  to a single end point and transport protocol.  Additionally 
it is extremely helpful to have some obvious way to programmatically 
disambiguate between URIs that offer different social/technical 
"contracts".

Kindest regards, Sean

--
Sean Martin
IBM Corp.


public-semweb-lifesci-request@w3.org wrote on 06/16/2006 02:51:52 AM:

> 
> There was an discussion a few weeks ago about URIs touch on various 
> issues. This message is an attempt to untangle them, something I said 
> I would write up as an action item in one of the HCLS conference 
> calls. We'll be discussing URIs at the monday BioRDF conference call.
> 
> As I read the discussion I partitioned it in to three distinct issues:
> 
> 1) The relationship between the use of a URI in a representation and 
> what it dereferences to, if anything. The possibilities seem to be:
> 
>    a) The identifier is not intended to be dereferencable. In that 
> case the info: scheme was suggested for the form of the uri, as that 
> is explicitly not dereferenceable.
> 
>    b) The URI is used primarily as a name. Insofar as we want use 
> names, it is important there be some stable URIs. Of course it 
> doesn't hurt if the URI becomes dereferenceable at some point, and it 
> would even be nice, so let's leave open that possibility (but caveats 
> in discussion below)
> 
>    c) Any URL we use needs to be able to be dereferenced to something.
> 
>    d) Any URL we use needs to be able to be dereferenced to the thing 
> it is (and not dereferenced if you can't do that). It's only meaning 
> is what it dereferences to.
> 
> 2) What a URI refers to. Some of this conversation was made in the 
> form of a discussion about what reasonable arguments to owl:sameAs 
> are - for example should one say that http://www.expasy.org/uniprot/ 
> P04637 is the sameAs http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ 
> efetch.fcgi?db=protein&id=NP_000537.
> 
> Another part of the conversation talked in terms of whether the URI 
> http://www.expasy.org/uniprot/P04637 should, for our purposes, refer 
> to a database record or to a thing in the world - Human P53 proteins.
> 
> Of course these are two sides of the same coin - you would only say 
> they the two URIs above referred to things in the world. As database 
> entries, they are obviously different. There are different fields, 
> they are in maintained by different people, etc.
> 
> 3) Something I will call the social aspect of URIs, for lack of a 
> better term. By this I mean those aspects process we go through to 
> come to a shared use of of URI. Under this category there is the 
> ontology building, the strategies for connecting pieces of 
> information generated by different groups. There was a bit in the 
> conversations where people were arguing about whether using sameAs 
> for mapping was pollution or a necessity, for instance. An important 
> part of this in our context is how to define the use of URLs to 
> things where there was not rigorous ontological engineering applied 
> to create careful definitions, things like terminologies and entries 
> in gene databases.
> 
> ---
> 
> I'll offer some of my own opinions on these issues now.
> 
> On the matter of what a URI dereferences to, I think it is more 
> important to get the names in place quickly. I don't agree with the 
> point of view that we should explicitly make them not 
> dereferenceable, even though I'm not sure what should come back when 
> we ask for what they point to yet. And I don't see support for there 
> being a necessity that anything that looks like a URL have a server 
> that returns something specific back. Here's a quote from RFC 3986,
> 
> > Although many URI schemes are named after protocols, this does not 
> > imply that use of these URIs will result in access to the resource 
> > via the named protocol.  URIs are often used simply for the sake of 
> > identification.
> 
> It will part of our social process to come to some understand and 
> agreement about what would be useful for us to have come back, if 
> anything. Is it an RDF graph? A bunch of OWL definitions of things 
> related to the gene? A representation of the asn record? A page of 
> HTML? All of the above?
> 
> On the question of what kind of concept an entrez gene URI refers to, 
> I think that concept needs to be "databaseRecord". There's too many 
> different concepts that it could mean if we want it to refer to 
> something in the world - does it refer to the sequence of the gene? 
> The typical gene? All mutations of it that are found in populations? 
> The possible gene products?
> 
> Rather, we can use the URI to the database entry to start to build 
> concepts by defining properties and using them in OWL class 
> definitions in a variety of ways. In foaf and SKOS, for instance, 
> there is a property isPrimarySubjectOf. The kind of equivalence we 
> can have between http://www.expasy.org/uniprot/P04637 and http:// 
> eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? 
> db=protein&id=NP_000537 is something like: The same something 
> isPrimarySubjectof http://www.expasy.org/uniprot/P04637 and  http:// 
> eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? 
> db=protein&id=NP_000537.
> where "something" is a blank node in RDF.  Or in OWL
> 
> Class(P53Gene complete
>      restriction(isPrimarySubjectof
>                    (value <http://eutils.ncbi.nlm.nih.gov/entrez/ 
> eutils/efetch.fcgi?db=protein&id=NP_000537>)))
> 
> Class(P53Transcript partial intersectionOf(mRNA restriction 
> (derivesFrom someValuesFrom(P53Gene))))
> 
> Which says that it is necessary and sufficient for x to be a 
> P53Gene,for example, if someone
> has stated or it has been inferred that
> 
> Individual(x value(isPrimarySubjectOf <http://www.expasy.org/uniprot/ 
> P04637>))
> 
> and that a P53 transcript, among other things,  is a mRNA that 
> derivesFrom some P53Gene.
> 
> (there will be more complicated definitions too :)
> 
> [sameAs, equivalentClass, equivalentProperty will be a necessity, I 
> think, BTW]
> 
> As for the social process, I look forward to the discussion on Monday :)
> 
> Regards,
> Alan
> 
> 
> http://www.w3.org/TR/uri-clarification/
> Uniform Resource Identifier (URI): Generic Syntax - http:// 
> tools.ietf.org/html/3986
> Relations in biomedical ontologies - http://genomebiology.com/ 
> 2005/6/5/R46
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://en.wikipedia.org/wiki/URL
>
Received on Friday, 16 June 2006 14:39:57 UTC