Re: Fw: Use of LSIDs in RDF (fwd)

Hi Greg,

GT>I am not aware of a way to programmatically identify a persistent HTTP
GT>URI. Making URIs persistent is largely a function of who is responsible
GT>for maintaining that URI's authority.

and

GT> This "contract" is a social contract, persistence based on a social
GT> contract can also be true of HTTP URIs.

That is one of the main difficulties with the HTTP URI used on its own for 
our purpose - that it has to mix up unique naming, with the current 
location & access [and to only a single copy of the object]. If this were 
not problem enough, it has historically provided access to both unique 
objects and objects that change.  Given any random URL, you have no idea 
how you can actually treat it in your program - there is just no way to 
tell that it is actually a name of something rather than just the network 
location of an object or concept that has perhaps a dynamic expression. 
There is also no way to determine which particular "social contract" 
applies to this particular HTTP URI. This seems to me to be partly because 
URL's have been used (abused?) in so many different ways in the past and 
partly because they try to do at least two or three things at once. I am 
not sure how one either could undo this past or provide the extra 
facilities required without introducing something new. 

In contrast, any LSID starts out unambiguously with both social and 
technical contracts that provide certainty both on what sort of thing you 
are getting and the multiple ways in which you might get it. You can 
actually code a program around its use and recognize it for what it is - a 
unique name for something - at first sight without accessing it. In most 
cases it maps down to HTTP for actual access, but this does not always 
need to be the case. It is important that first and foremost, the LSID is 
unambiguously a unique name for something that might have many copies 
stored around the network. I believe this is why a URN spec. was chosen. 
Resolution was secondary. The technical contract of the LSID primarily 
gives you a standard method to enquire for where those places to obtain an 
exact copy of this particular object are (and there could be many, 
including some place local to your organization) and secondly a standard 
method to enquire widely (at the original authority, at other trusted 
authorities, at your organization, or an organization you collaborate 
with) about those places where information about this object and its 
relationships to other objects can be retrieved. 

If there ever was any reason at all to create a URN, I believe that 
uniquely identifying life science information is a reasonable one. The 
earliest web standards define how URN's should be named and the RFC's also 
provide guidance on creation of methods for dereferencing them. There are 
also a whole bunch of more recent standards like SOAP and WSDL that seem 
to have gained wide acceptance. Why not use them? The use cases fit. The 
fact that this LS URN and its specification is backwardly compatible with 
the web, using its name resolution and access protocols as well as a 
future semantic web seems all the better to me. The alternative is to 
attempt to shoe horn the problem into a protocol that was not designed to 
meet the needs.

GT>People are using URLs (HTTP URIs for naming), for example:
GT>http://www.biomedcentral.com/pubmed/12225585

Is this really a name for something or just a convenient link to 
something? In the context that you give it, it appears to me to be more 
like a link. NCBI have an entirely different name(s) for this thing. A 
third party providing a similar convenient link would have created a third 
name. If all places had used the LSID 
(urn:lsid:ncbi.org:pubmed:12225585:1) we (and our software) would know 
they were all talking about the same thing without having to do a thing. 
Now if we actually want a copy of it, we dereference it to fetch one via 
of any of those three HTTP URI's. Similarly if we want to know more about 
it, we ask places that may have metadata for it. It seems to me that the 
LSID getAvailable method might be updated to make use of the URIQA 
protocol URL's as one possible way of implementing the get getMetadata 
method (URIQA style HTTP URI links could be provided as the port types in 
the returned WSDL).

GT> However HTTP has the 3XX error codes to provide redirection etc.

Which HTTP URI is the unique name now? The original or the new location? 
Furthermore, how can I tell these two names are equivalent and reference 
the same object, especially as some folks discover and link to the newer 
name?

GT>One aspect of this that bothers me though, is
GT>partitioning of the semantic web into domains based on their metadata
GT>access interfaces. Access to metadata based on URIs alone only makes
GT>sense to me if the mechanisms to get the me is general for the web.\

Good point. The metadata access mechanisms for LSID's are mapped down onto 
exactly the same HTTP URI's everyone uses today. No point in reinventing a 
wheel.

GT>It is true that only the widespread adoption of LSID will make it
GT>useful to the semantic web. I am guessing by default (laziness ?)
GT>HTTP URIs will be used as resources identifier if a LSID is not
GT>*easily* usable i.e. tools, tools, tools...

Yup :-) but don't forget that LS information on the web tody is not 
generally in any kind of semantic web right now - it is just plain old 
web. Many people in the industry perceive the need for the LSID but have 
no particular interest in semantic web (yet!). However if adopting LSID 
means they now become part of a semantic web [because that's what the LSID 
spec. says to do], the semantic web folks might benefit. That is why it is 
important we get this right.

PDB is definitely down at the moment. We are working with them to bring up 
the authority there again with both the latest code (their previous 
version was using an out of date spec.) as well as extensive meta-data - 
previously it was a bare bones amount. Unfortunately they are snowed under 
right now with a major release of new code for their web site and 
database. I am not sure why 
the NCBI authority is down for you. For examples of NCBI data try LSID's 
like 

urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:genbank_gi:30350027
urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:pubmed:12225585
urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:genbank:bm872070

Omim
urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:omim:605956   (omimuser/omimpass)


Kindest regards, Sean

--
Sean Martin
IBM Corp.
 




Greg Tyrelle <greg@tyrelle.net> 
04/29/2004 02:53 AM

To
Sean Martin/Cambridge/IBM@IBMUS
cc
public-semweb-lifesci@w3.org
Subject
Re: Fw: Use of LSIDs in RDF (fwd)






*** Sean Martin wrote:
|BG>This leads me to a question about "persistent" URI's and URL's
| BG>(PURLS's): How do you ensure that two URI's are pointing at the same
| BG>object (bytes)?
|
|My question is how does one programmatically identify a persistent HTTP
|URI, as opposed to one that will retrieve tomorrow's weather or perhaps
|retrieve a file from a P2P network or one that returns dynamically
|changing content? Apologies in advance if there is an obvious answer to
|this question.

I am not aware of a way to programmatically identify a persistent HTTP
URI. Making URIs persistent is largely a function of who is responsible
for maintaining that URI's authority.

If I understand correctly the question you are asking is "tell me
something about the resource being identified by this URI ?". There
are a number of approaches to this. In the case of LSID this would be
the getMetaData interfaces. For HTTP URIs my current favourite is
URIQA (MGET HTTP method extension i.e. metadata get) [1]. RDDL [2] is
intended for this purpose but mainly for namespaces.

|HTTP URI's as probably the primary method of retrieval of the data object
|or meta-data about that object - after all much of the public LS data is
|actually out there on the web already retrievable by HTTP URI. If HTTP
|URI's were sufficient today, we would not have need of the LSID. So
|perhaps the question you should ask your self is why are people not
|already widely using URL's for LS naming?

People are using URLs (HTTP URIs for naming), for example:

http://www.biomedcentral.com/pubmed/12225585


is a 302 redirect to the NCBI URL

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12225585&dopt=Abstract&holding=f1000


Again, I believe it is how HTTP URIs are used or managed which is the
problem, not that they are broken or insufficient technology for the
purpose of naming.

|For me the main points are:
|Location independence of the object named - the extra layer of 
indirection
|makes this flexibility possible - there is a starting assumption that
|users will make/exchange local copies of the objects and also that
|authority entities will at some point want to transfer the authority over
|a LSID to another authority entity - while potentially maintaining 
control
|of their domain name, sometimes the same data is served from more than 
one
|"official" place on the web(e.g. Swiss-Prot - Marja, how does Annotea 
deal
|with this situation?), having the option of not using domain names in the
|identifier at all;

Good points. However HTTP has the 3XX error codes to provide
redirection etc. why invent a new protocol when these already exist ?

|Providing/using LSID's for one's data establishes a "contract" in which
|certain properties can be assumed (beyond those of the HTTP URI
|"contract") of an LSID named object:
|defines what can safely be assumed about multiple copies of objects which
|have the same LSID name - i.e. that they are identical; clear definition
|of what persistence means [both availability and never modifying a named
|object];

This "contract" is a social contract, persistence based on a social
contract can also be true of HTTP URIs.

|a formal mechanism for retrieving data [never ever changes] over multiple
|protocols and discovering and retrieving meta-data [which can change]
|about that object and its relationship to other objects [from the 
original
|source of the object or from a third-party who has something to add of
|their own] all using a single globally unique name.

I think the selling point of LSID (for me) is a standard interface for
life sciences metadata. One aspect of this that bothers me though, is
partitioning of the semantic web into domains based on their metadata
access interfaces. Access to metadata based on URIs alone only makes
sense to me if the mechanisms to get the me is general for the web.

|One parting thought.. widespread adoption of LSID spec. across the
|industry will at the same time create a very large semantic web.

It is true that only the widespread adoption of LSID will make it
useful to the semantic web. I am guessing by default (laziness ?)
HTTP URIs will be used as resources identifier if a LSID is not
*easily* usable i.e. tools, tools, tools...

My limited testing of the perl LSID clent implementation, the only
LSIDs I was able to resolve were from the North temperate Lakes [3]
authority. Both the PDB and NCBI authority URLs were not working (or I
couldn't get them to work with the perl client).

_greg

[1] http://sw.nokia.com/uriqa/URIQA.html
[2] http://www.rddl.org/
[3] http://lsid.limnology.wisc.edu/

--
Greg Tyrelle

Received on Thursday, 29 April 2004 14:39:13 UTC