[BioRDF] All about the LSID URI/URN

Hello All,
On last weeks BioRDF call, Eric Neumann asked me to post information here 
to help everyone follow the LSIDs/Life Science Identifier as a URI 
discussion. So here goes. 

Firstly let me point you to two articles I was co-author on, with the 
suggestion that you read them in the order listed. 

Clark T, Martin S, Liefeld T. [1]
Globally distributed object identification for biological knowledgebases.
 Brief Bioinform. 2004 Mar;5(1):59-70. Related Articles, Links 
PMID: 15153306

Martin S, Hohman MM, Liefeld T. [2]
The impact of Life Science Identifier on informatics data.
Drug Discov Today. 2005 Nov 15;10(22):1566-72. 
PMID: 16257380

Together these provide nearly the entire LSID story to date. They include 
the motivation for the creation of the LSID standard, an explanation of 
what the syntax is; a description of how the underlying protocol actually 
works; and discussion of how LSID naming can be retroactively applied 
without much difficulty to information sources already online as well as 
new ones. The second article talks about early adopters of the 
specification and what they are actually doing with LSIDs, it tackles some 
of the common misconceptions and concerns and concludes with a list of 
problems and some suggestions for improvements to the current 
specification. 

For those that enjoy such activities, the full specification of the Life 
Science Identifier is publicly available for your reading pleasure from 
the Object Management Group[3], but if you have read the above two 
articles, you are already well enough armed to enter the debate. As I 
mentioned in an earlier posting, there is quite a useful description in 
Section 13.3, Page 26 (page 32 in the pdf file) of the spec that describes 
in a relatively human readable step by step example form how the 
resolution protocol actually works to decouple the LSID name from the 
network location of the digital object named.

Next, assuming you have by now read the two articles listed above, let me 
try to add a little information which they do not cover around the issues 
concerning URLs as Life Science URIs that directly led to the creation of 
the LSID URN. 

It is certainly true that the DNS system and the semantics of file system 
paths ensure that by using a URL as a URI you get an easy means to produce 
globally unique name. The problems begin if you either want to do more 
than create just a name in the ?ether? and actually use the URL to 
uniquely name existing binary data objects, or if you are tempted to do 
what is natural with a URL which is put it into a web browser and 
dereference it to something you or your program can look at. 

Obviously a URL can do some measure of both of these things and at first 
glance it might seem that they are perhaps even the same thing. But this 
is where it starts to get tricky. The root of the problem is that the URL 
contains in it more than just a name. It also contains the network 
location where the only copy of the named object can be found (this is the 
hostname or ip address) as well as the only means by which one may 
retrieve it (the protocol, usually http, https or ftp). The first question 
to ask yourself here is that when you are uniquely naming (in all of space 
and time!) a file/digital object which will be usefully copied far and 
wide, does it make sense to include as an integral part of that name the 
only protocol by which it can ever be accessed and the only place where 
one can find that copy? Furthermore, does it make sense to use as part of 
the name a DNS hostname which may easily be transferred to a new owner if 
the underlying DNS domain name changes hands? In a system where the 
resolution of the URI to a copy of the object it names has no layers of 
indirection, one becomes entirely reliant on the issuer of that name or 
their successors in interest to continue to provide service for it. One 
has only have to have observed the web for a few years to understand how 
brittle it is both in terms of objects moving or being taken offline (the 
dreaded 404 HTTP error) or a domain name being passed to a new concern 
that has a different set of objectives. Schemes like PURL[4] exist to 
combat both these problems to some extent. For a general discussion on how 
successful this is I refer you to the DOI Handbook, section 3.10 [5] that 
details a number arguments comparing the DOI scheme to the PURL many of 
which would equally apply to the LSID scheme. 

To add a little to those discussions and to make it more specific to Life 
Sciences please consider the following. One problem that the PURL scheme 
does not overcome is the requirement that it should be possible for the 
named digital object to be available from multiple locations that copy the 
original, possibly long after the original is no longer available from the 
original source. In fact the LSID scheme goes one further as it provides a 
standard method for getting no only a copy of the named object from 
multiple locations using multiple protocols, but also there are methods 
for retrieving and combining metadata from _multiple_ different sources, 
all keyed off the same URI. Another problem with the PURL scheme is that 
they cannot be applied retroactively. You need to use PURL identifiers up 
front. This is a problem when you want to identify objects internally in 
private for a while, but then after a year or two would like to expose 
them externally without changing their names. Does it make sense to create 
an external redirection reference for every image your research produces ? 
and then to be consistent you would need to use and dereference that image 
via the PURL service every time you referred to it, unless you are happy 
to deal with the complexities of having both an internal as well as an 
external permanent name for the same object. This leads me to the 
potential issues of scale and reliability in the face of the 
extraordinarily large number of identifiers that will exist in the Life 
Sciences domain alone. Given the extremely successful, highly distributed 
nature of the WWW, when does it make sense to use a scheme which relies 
entirely on a single centralized redirection service both for registration 
& resolution? 

The notion that the name should be independent of the means used to get a 
copy of the object is also important, particularly as more sophisticated 
transport protocols are introduced they can easily be included as 
alternative or even primary protocols for access to the data ? for example 
if a community wanted to take advantage of the relatively recently 
introduced Bittorrent[6] scheme, a P2P protocol optimized for quickly 
sharing large binary objects across a network.

Another serious concern regarding using URL?s to name digital objects is 
the question of ?what is actually named?" In the Life Science research 
process it is frequently necessary to reproduce the results of another 
groups experiments. To do this successfully one needs to be certain one is 
using exactly the same inputs used by the original experimenter. Similarly 
when basing new research on the work of another it is important to know 
one is actually using the exact outputs of the earlier research. ?In 
silico? experimentation requires absolute precision. Unfortunately when it 
comes to URL?s there is no way to know that what is served one day will be 
served out the next simply by looking at the URL string. There is no 
social convention or technical contract to support the behavior that would 
be required. Indeed the URL concept has been so extremely successful 
precisely because it was allowed to conflate the original document access 
methods with remote procedure calls (RPCs) when the CGI interface [7] was 
first introduced in1993. The introductions of XMLRPC and Web Services have 
cemented this confusion. One type of URL response may be happily cached, 
perhaps for ever, the other type probably should not, but to a machine 
program the URL looks the same and without recourse to an error prone set 
of heuristics it is extremely difficult perhaps impossible to 
programmatically tell the difference. Given that what we are designing is 
meant to be a machine readable web, it is vital to know which URLs behave 
the way we want and which do not. Can one programmatically tell the 
difference without actually accessing them? Can one programmatically tell 
the difference even after accessing them? A serious follow-up problem with 
using URLs as names is that they have no inherent versioning scheme which 
makes it hard for machines (and people!) to know when revisions are made 
to previously named data. 

The general URN [8] scheme was devised for naming resources and there is a 
process for registering new URN schemes. As you will know by now if you 
have read the first listed article, there are also standards in place for 
dereferencing URNs which avoid the issues related to URLs discussed above. 
The advent of the Life Science Identifier/LSID specification was due to a 
consortium of Life Science domain interested parties who chose to take 
advantage of these pre-existing standards and specifications to create a 
identifer that met their needs. URNs are URIs. 

As I mentioned at the end of the last conference call, the one size fits 
all approach may be too limited and it is my belief that we could well end 
up both needing and having to accommodate more types of standards based 
URIs than we currently know about. Some will be URL based like Dublin Core 
[9], some will have a URL representation like DOIs and some will be URNs 
schemes like the LSID. Each will arise to serve a particular set of needs 
for particular communities and will have their own social and technical 
?contracts? that will make more tractable the problems of making the 
data/concepts named both machine accessible & readable. The more 
successful of these will have more software written to support their 
specific peculiarities and features, some will persist and some will be a 
passing phase.

Lee Feigenbaum recently pointed out to me that the job in the Semantic Web 
groups like ours should be to both expect and embrace this diversity in 
our decisions & recommendations as it will help the Semantic Web grow in 
size and usefulness. One thing that might be done to accommodate this 
proliferation of URI types is perhaps to work to achieve a common 
interface to them through URL gateways and to reach consensus on a 
?least/lowest common denominator? set of properties that one should expect 
from these gateways. This would mean that important common tools like 
web/data-web browsers and distributed SPARQL query tools engines would 
work across as wide a set of base information as possible. Both the DOI 
[10] and LSID [11] schemes have such gateways. For example here [12] is an 
LSID based URL link to the RDF metadata of one of the articles I recommend 
at the start of this post using an LSID web gateway: 
http://lsid.biopathways.org/resolver/urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:pubmed:15153306

No doubt I f have left out much that I should have mentioned here, so I 
reserve the right to a follow-up post or two as I remember or people 
remind me about areas that require your consideration.


[1] Globally distributed object identification for biological 
knowledgebases.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&list_uids=15153306&itool=pubmed_abstractplus&dopt=abstract&dr=abstract
[2] The impact of Life Science Identifier on informatics data. 
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=abstract&list_uids=16257380&query_hl=9&itool=pubmed_docsum
[3] http://www.omg.org/cgi-bin/doc?dtc/04-05-01
[4]  http://purl.oclc.org/
[5] http://www.doi.org/handbook_2000/resolution.html#3.10
[6] http://en.wikipedia.org/wiki/Bittorrent
[7] http://en.wikipedia.org/wiki/Common_Gateway_Interface
[8] http://en.wikipedia.org/wiki/Uniform_Resource_Name
[9] http://dublincore.org/schemas/rdfs/
[10] http://www.doi.org/doi_proxy/index.html
[11] http://lsid.biopathways.org/resolver/
[12] 
http://lsid.biopathways.org/resolver/urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:pubmed:15153306



Have a great (and for some long) weekend.
Kindest regards, Sean

--
Sean Martin
IBM Corp
Cambridge, MA

Received on Friday, 30 June 2006 14:19:27 UTC