Re: [BioRDF] All about the LSID URI/URN

As background, it might help to know that one of the earliest requirements 
of the I3C's LSID group was that bio/pharma companies be able to copy 
large chunks of public databases internally so as to be able to access 
them repeatedly in private. They also wanted to be certain of exactly what 
version of the information they were using as it changed frequently. The 
LSID URN scheme includes the means to allow them to do both of these 
seamlessly. Scientists are able to use LSIDs to name or identify objects 
received via mail from a colleague, resolved over the web, retrieved from 
a copy archive or simply copied out of a file system, without ambiguity 
because the location of the object is unimportant and the naming syntax 
provides for versioning information. There can be no doubts, no further 
complex namespace rules from the provider of that data that need to be 
consulted. In short, a machine can easily be made to do it all. Perhaps it 
might help to think about LSIDs more in the taking a couple of copies here 
and there sense rather than the web proxy caching sense. 

Anyway a couple of embedded comments included below: wrote on 07/07/2006 08:57:33 AM:

> > The root of the problem is that the URL 
> > contains in it more than just a name. It also contains the network 
> > location where the only copy of the named object can be found (this is 
> > hostname or ip address) 
> Which URL is that? It's not true of all URLs. Take, for example,
> That URL does not contain the network location where the only
> copy can be found; there are several copies on mirrors around the
> globe.
> $ host
> has address
> has address
> has address
> has address
> has address

Can you explain this a little further please Dan? If perhaps you mean that 
the W3C has mirrors and a DNS server that responds with the appropriate IP 
addresses depending on where you are coming from or which servers are 
available or have less load, I agree that the URL points to multiple 
copies of the object but any single access path to that object is always 
determined by the URL issuing authority. I actually wrote the original 
code to do just this for IBM sports web sites in the mid 90's! I am sure 
though that you will appreciate that this is not at all the same thing as 
being able to actively source the named object from multiple places, where 
the client side chooses the both the location and the means of access 
(protocol) and that this can still be done if the original issuing 
authority for the object goes away. From the client?s point of view, with 
a URL the protocol and the location are fixed and if either disappears the 
client has no way to ask anyone else for a copy. In my original post my 
thoughts were for the second of these meanings as the first has been 
obviously in practice for over a decade now. Sorry for not being  explicit 

> FYI, the TAG is working on a finding on URNs, Namespaces, and 
> the current draft has a brief treatment of this issue of location 
> (in)dependence...
> > as well as the only means by which one may 
> > retrieve it (the protocol, usually http, https or ftp). The first 
> > to ask yourself here is that when you are uniquely naming (in all of 
> > and time!) a file/digital object which will be usefully copied far and 

> > wide, does it make sense to include as an integral part of that name 
> > only protocol by which it can ever be accessed and the only place 
> > one can find that copy?
> If a better protocol comes along, odds are good that it will be usable
> with names starting with http: 

I am not sure I understand how this can be possible. Sure for evolved HTTP 
perhaps, but for protocols that have not yet been conceived I am not so 

> See section 2.3 Protocol Independence

hmm I am not sure I can buy the argument at the above link yet. Is this 
even an argument?  ....because myURIs always map to http anyway it is the 
same as if it were http, so why bother..?

The main difference as far as I can see is that the mapping provides a 
level of indirection. This seems quite a significant difference and may be 
the point of having a myURI in the first place. The intention no doubt 
being to leave room for other protocols as they emerge and not tie a name 
to a single one as well as provide flexibility for actual deployment.  In 
my experience indirection is the great friend of people doing actual 
deployments. Also in this case protocol includes not just the technical 
TCP socket connection and GET headers etc, but also has to include the 
issues surrounding domain ownership too which are part of the resolution 
process.  While we may be reasonably certain about the technical issues, 
the uncertainties of tying ones long term identifier to a hostname (even a 
virtual one like ) are considerable and in the face of this a 
layer of indirection begins to look quite prudent. 

Also note that this is not about just pure naming since retrieval is 
explicitly intended for both data and metadata from multiple sources. 
LSIDs are already mapped to multiple protocols (which would not be 
possible if you did not have indirection), certainly this includes http 
URLs but also ftp & file:// URL's for wide area file systems as well as 
SOAP (which itself supports multiple transport protocols).  The LSID spec 
explicitly allows for the client to accumulate metadata from multiple 
metadata stores using multiple protocols without duplication using just 
the single URN.
> > Unfortunately when it 
> > comes to URL?s there is no way to know that what is served one day 
will be 
> > served out the next simply by looking at the URL string. There is no 
> > social convention or technical contract to support the behavior that 
> > be required.
> Again, that's not true for all URLs. There are social and technical
> means to establish that
> can be cached for a long time.

Yes, but which URLs? My original post went on to say:

`One type of URL response may be happily cached, perhaps for ever, the 
other type probably should not, but to a machine program the URL looks the 
same and without recourse to an error prone set of heuristics it is 
extremely difficult perhaps impossible to programmatically tell the 
difference. Given that what we are designing is meant to be a machine 
readable web, it is vital to know which URLs behave the way we want and 
which do not. Can one programmatically tell the difference without 
actually accessing them? Can one programmatically tell the difference even 
after accessing them?`

Perhaps I should have written that `it is vital to know which URIs behave 
the way we want and which do not`. You fairly responded that HTTP has an 
expires header and that the social conventions around how to split up a 
URL [and what meaning the various parts of the substrings have and what 
this implies for their shelf lives or versioning] can be written and 
published - perhaps even in a machine readable form.  But for automation 
one would need to dereference even this at some point and load it into the 
machines understanding stack somehow. For the time being one would need to 
program in the different heuristics for every individual data source.  A 
long road to hoe I think, and one that would likely defeat the objective 
of having a simple stack that can fetch the data given an identifier. We 
would be kidding ourselves if we did not acknowledge serious adoption 
problems.  Note that I was optimistic on the `cached, perhaps for ever` 
statement, because as you note http expires only supports caching for a 
year. Does this mean that the object named could change after a year? (Who 
knows). This would be a problem for this community for both scientific and 
legal reasons. Of course this is one area that has some reasonably easy 

Given both the mixed up social contract (is it an RCP transport or a 
permanent [!?] document name,  how does one version it, and who owns the 
hostname now)  baggage surrounding HTTP URLs  as well as a number of 
technical short comings given the communities requirements, it is not hard 
to see how the idea of a  new URN with its own specialized technical and 
social contracts provided a fresh start and yet still  mapped down onto 
existing internet infrastructure.

Kindest regards, Sean

Sean Martin
IBM Corp

Received on Friday, 7 July 2006 16:42:34 UTC