Re: Fw: Use of LSIDs in RDF (fwd) from Sean Martin on 2004-04-29 (public-semweb-lifesci@w3.org from April 2004)

From: Sean Martin <sjmm@us.ibm.com>
Date: Thu, 29 Apr 2004 14:38:30 -0400
To: Greg Tyrelle <greg@tyrelle.net>
Cc: public-semweb-lifesci@w3.org
Message-ID: <OF17605AC4.AADF6173-ON85256E85.00412BD7-85256E85.00666777@us.ibm.com>

Hi Greg,

GT>I am not aware of a way to programmatically identify a persistent HTTP
GT>URI. Making URIs persistent is largely a function of who is responsible
GT>for maintaining that URI's authority.

and

GT> This "contract" is a social contract, persistence based on a social
GT> contract can also be true of HTTP URIs.

That is one of the main difficulties with the HTTP URI used on its own for
our purpose - that it has to mix up unique naming, with the current
location & access [and to only a single copy of the object]. If this were
not problem enough, it has historically provided access to both unique
objects and objects that change. Given any random URL, you have no idea
how you can actually treat it in your program - there is just no way to
tell that it is actually a name of something rather than just the network
location of an object or concept that has perhaps a dynamic expression.
There is also no way to determine which particular "social contract"
applies to this particular HTTP URI. This seems to me to be partly because
URL's have been used (abused?) in so many different ways in the past and
partly because they try to do at least two or three things at once. I am
not sure how one either could undo this past or provide the extra
facilities required without introducing something new.

In contrast, any LSID starts out unambiguously with both social and
technical contracts that provide certainty both on what sort of thing you
are getting and the multiple ways in which you might get it. You can
actually code a program around its use and recognize it for what it is - a
unique name for something - at first sight without accessing it. In most
cases it maps down to HTTP for actual access, but this does not always
need to be the case. It is important that first and foremost, the LSID is
unambiguously a unique name for something that might have many copies
stored around the network. I believe this is why a URN spec. was chosen.
Resolution was secondary. The technical contract of the LSID primarily
gives you a standard method to enquire for where those places to obtain an
exact copy of this particular object are (and there could be many,
including some place local to your organization) and secondly a standard
method to enquire widely (at the original authority, at other trusted
authorities, at your organization, or an organization you collaborate
with) about those places where information about this object and its
relationships to other objects can be retrieved.

If there ever was any reason at all to create a URN, I believe that
uniquely identifying life science information is a reasonable one. The
earliest web standards define how URN's should be named and the RFC's also
provide guidance on creation of methods for dereferencing them. There are
also a whole bunch of more recent standards like SOAP and WSDL that seem
to have gained wide acceptance. Why not use them? The use cases fit. The
fact that this LS URN and its specification is backwardly compatible with
the web, using its name resolution and access protocols as well as a
future semantic web seems all the better to me. The alternative is to
attempt to shoe horn the problem into a protocol that was not designed to
meet the needs.

GT>People are using URLs (HTTP URIs for naming), for example:
GT>http://www.biomedcentral.com/pubmed/12225585

Is this really a name for something or just a convenient link to
something? In the context that you give it, it appears to me to be more
like a link. NCBI have an entirely different name(s) for this thing. A
third party providing a similar convenient link would have created a third
name. If all places had used the LSID
(urn:lsid:ncbi.org:pubmed:12225585:1) we (and our software) would know
they were all talking about the same thing without having to do a thing.
Now if we actually want a copy of it, we dereference it to fetch one via
of any of those three HTTP URI's. Similarly if we want to know more about
it, we ask places that may have metadata for it. It seems to me that the
LSID getAvailable method might be updated to make use of the URIQA
protocol URL's as one possible way of implementing the get getMetadata
method (URIQA style HTTP URI links could be provided as the port types in
the returned WSDL).

GT> However HTTP has the 3XX error codes to provide redirection etc.

Which HTTP URI is the unique name now? The original or the new location?
Furthermore, how can I tell these two names are equivalent and reference
the same object, especially as some folks discover and link to the newer
name?

GT>One aspect of this that bothers me though, is
GT>partitioning of the semantic web into domains based on their metadata
GT>access interfaces. Access to metadata based on URIs alone only makes
GT>sense to me if the mechanisms to get the me is general for the web.\

Good point. The metadata access mechanisms for LSID's are mapped down onto
exactly the same HTTP URI's everyone uses today. No point in reinventing a
wheel.

GT>It is true that only the widespread adoption of LSID will make it
GT>useful to the semantic web. I am guessing by default (laziness ?)
GT>HTTP URIs will be used as resources identifier if a LSID is not
GT>*easily* usable i.e. tools, tools, tools...

Yup :-) but don't forget that LS information on the web tody is not
generally in any kind of semantic web right now - it is just plain old
web. Many people in the industry perceive the need for the LSID but have
no particular interest in semantic web (yet!). However if adopting LSID
means they now become part of a semantic web [because that's what the LSID
spec. says to do], the semantic web folks might benefit. That is why it is
important we get this right.

PDB is definitely down at the moment. We are working with them to bring up
the authority there again with both the latest code (their previous
version was using an out of date spec.) as well as extensive meta-data -
previously it was a bare bones amount. Unfortunately they are snowed under
right now with a major release of new code for their web site and
database. I am not sure why
the NCBI authority is down for you. For examples of NCBI data try LSID's
like

urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:genbank_gi:30350027
urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:pubmed:12225585
urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:genbank:bm872070

Omim
urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:omim:605956 (omimuser/omimpass)

Kindest regards, Sean

--
Sean Martin
IBM Corp.

Greg Tyrelle <greg@tyrelle.net>
04/29/2004 02:53 AM

To
Sean Martin/Cambridge/IBM@IBMUS
cc
public-semweb-lifesci@w3.org
Subject
Re: Fw: Use of LSIDs in RDF (fwd)

*** Sean Martin wrote:
|BG>This leads me to a question about "persistent" URI's and URL's
| BG>(PURLS's): How do you ensure that two URI's are pointing at the same
| BG>object (bytes)?
|
|My question is how does one programmatically identify a persistent HTTP
|URI, as opposed to one that will retrieve tomorrow's weather or perhaps
|retrieve a file from a P2P network or one that returns dynamically
|changing content? Apologies in advance if there is an obvious answer to
|this question.

I am not aware of a way to programmatically identify a persistent HTTP
URI. Making URIs persistent is largely a function of who is responsible
for maintaining that URI's authority.

If I understand correctly the question you are asking is "tell me
something about the resource being identified by this URI ?". There
are a number of approaches to this. In the case of LSID this would be
the getMetaData interfaces. For HTTP URIs my current favourite is
URIQA (MGET HTTP method extension i.e. metadata get) [1]. RDDL [2] is
intended for this purpose but mainly for namespaces.

|HTTP URI's as probably the primary method of retrieval of the data object
|or meta-data about that object - after all much of the public LS data is
|actually out there on the web already retrievable by HTTP URI. If HTTP
|URI's were sufficient today, we would not have need of the LSID. So
|perhaps the question you should ask your self is why are people not
|already widely using URL's for LS naming?

People are using URLs (HTTP URIs for naming), for example:

http://www.biomedcentral.com/pubmed/12225585

is a 302 redirect to the NCBI URL

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12225585&dopt=Abstract&holding=f1000

Again, I believe it is how HTTP URIs are used or managed which is the
problem, not that they are broken or insufficient technology for the
purpose of naming.

|For me the main points are:
|Location independence of the object named - the extra layer of
indirection
|makes this flexibility possible - there is a starting assumption that
|users will make/exchange local copies of the objects and also that
|authority entities will at some point want to transfer the authority over
|a LSID to another authority entity - while potentially maintaining
control
|of their domain name, sometimes the same data is served from more than
one
|"official" place on the web(e.g. Swiss-Prot - Marja, how does Annotea
deal
|with this situation?), having the option of not using domain names in the
|identifier at all;

Good points. However HTTP has the 3XX error codes to provide
redirection etc. why invent a new protocol when these already exist ?

|Providing/using LSID's for one's data establishes a "contract" in which
|certain properties can be assumed (beyond those of the HTTP URI
|"contract") of an LSID named object:
|defines what can safely be assumed about multiple copies of objects which
|have the same LSID name - i.e. that they are identical; clear definition
|of what persistence means [both availability and never modifying a named
|object];

This "contract" is a social contract, persistence based on a social
contract can also be true of HTTP URIs.

|a formal mechanism for retrieving data [never ever changes] over multiple
|protocols and discovering and retrieving meta-data [which can change]
|about that object and its relationship to other objects [from the
original
|source of the object or from a third-party who has something to add of
|their own] all using a single globally unique name.

I think the selling point of LSID (for me) is a standard interface for
life sciences metadata. One aspect of this that bothers me though, is
partitioning of the semantic web into domains based on their metadata
access interfaces. Access to metadata based on URIs alone only makes
sense to me if the mechanisms to get the me is general for the web.

|One parting thought.. widespread adoption of LSID spec. across the
|industry will at the same time create a very large semantic web.

It is true that only the widespread adoption of LSID will make it
useful to the semantic web. I am guessing by default (laziness ?)
HTTP URIs will be used as resources identifier if a LSID is not
*easily* usable i.e. tools, tools, tools...

My limited testing of the perl LSID clent implementation, the only
LSIDs I was able to resolve were from the North temperate Lakes [3]
authority. Both the PDB and NCBI authority URLs were not working (or I
couldn't get them to work with the perl client).

_greg

[1] http://sw.nokia.com/uriqa/URIQA.html
[2] http://www.rddl.org/
[3] http://lsid.limnology.wisc.edu/

--
Greg Tyrelle

Received on Thursday, 29 April 2004 14:39:13 UTC