Re: [BioRDF] All about the LSID URI/URN from Sean Martin on 2006-07-12 (public-semweb-lifesci@w3.org from July 2006)

From: Sean Martin <sjmm@us.ibm.com>
Date: Wed, 12 Jul 2006 09:16:26 -0400
To: public-semweb-lifesci@w3.org, Alan Ruttenberg <alanruttenberg@gmail.com>
Cc: Dan Connolly <connolly@w3.org>
Message-ID: <OF6821AE02.E735BE73-ON852571A9.0040FAF5-852571A9.0048F942@us.ibm.com>
Hello Alan,
The short answer is that only some parts of what the LSID scheme does 
could be done using the means you suggest. The reason for this is that 
what you outline is more or less part of the LSID resolution process under 
the covers. However in the end it would not meet a number the original 
requirements and would require new infrastructural mechanisms that 
somewhat defeat the purpose of sticking with http://. 

Let me respond with comments embedded below:

public-semweb-lifesci-request@w3.org wrote on 07/07/2006 04:52:01 PM:

> 
> Sean, couldn't what LSID achieves be done, for instance, by having a 
> convention that if someone dereferences, for example,
> 
> http://bla.com/path/to/document/foo.lsid
>

As you initially start with a URL, you obviously have the initial location 
and protocol dependency issues raised but not addressed in earlier posts. 
In summary it is my experience that when one names existing objects with 
long persistence that are intended for wide area distributions it is both 
prudent and practical to separate that name from the mechanism for 
resolution. 

Also because you use a URL you are forced to always dereference it to 
understand its current contract. One cannot programmatically tell the 
difference in contracts between http://bla.com/path/to/document/foo.lsid 
and http://www.cnn.com/index.html without dereferencing both of them and 
locally storing and then comparing details of their particular contracts. 
This means that one cannot just safely assume that the name string 
http://www.cnn.com/index.html names something that is the same 
http://www.cnn.com/index.html a day later. Nor that the object someone 
sent me named http://bla.com/path/to/document/foo.lsid is the same as the 
object I can retrieve if I dereference 
http://bla.com/path/to/document/foo.lsid right now. 

Should the person who sent me an object also send me a copy of the 
persistence policy perhaps? How often does one go back in ones email and 
click on URL links that are now broken? How many binary attachments do you 
have in your email that you cannot figure out what their data source was 
without opening them up and doing some human level heuristics or perhaps 
doing a Google string match? These are the sorts of problems that the LSID 
addresses but of course not just for email.

> 
> it is understood to obey a protocol, namely to return a snippet of 
> rdf that says, here's a handle to my metadata, here's a handle to my 
> data, here's my machine readable persistence policy.  Or instead of 
> returning rdf, the link response mentioned in http://www.w3.org/2001/ 
> tag/doc/URNsAndRegistries-50.html could be used to point to the 
> auxillary information.

This is similar to the LSID scheme, except that LSID resolution uses a 
WSDL document to communicate the possible data and metadata service 
end-points. Since the LSID scheme only has one contract regarding 
persistence (the hard rule that the LSID may never be reused to name any 
other bytes), there is no need to pass persistence information. This means 
that the LSID string alone can be used to compare for equality between 
objects. For the caching of metadata (which can change over time) the LSID 
scheme defers to the transport mechanism over which that metadata was 
obtained for an indication the length of time the metadata should 
considered valid. This is one area where I believe the LSID standard 
should be improved so as to formally address both persistent and 
non-persistent metadata and is something the caBIG folks wanted. 

> 
> And if that persistence policy says that the data is immutable, then 
> you can comfortably store it, and use this URI for as a handle for 
> resolving, in the same way an LSID can be resolved by an http service
> 
> http://lsid.company.net/resolver/http://bla.com/path/to/document/ 
> foo.lsid
> 
> The resolved could pull back whatever information you have locally, 
> return source information, or redirect to the id, like a click through.

Note that this would require new proxy and browser client infrastructure 
that can understand how to interpret and act on these policies and this 
higher level protocol. The behavior on existing infrastructure would 
likely be broken as one could not just put one of these URIs into a 
browser or proxy server and have it do the right things. This weakens the 
argument that the reason we are so keen to only use http:// URLs as URIs 
is because of all the deployed existing infrastructure makes adoption 
easier. The part of the web infrastructure that would just work today is 
also the exact same part of the infrastructure that the LSID resolution 
scheme uses.

> 
> This seems to satisfy the requirement that you can tell what sort of 
> thing it is from looking at it, as well as the desired ability to 

I am not sure what you mean by `looking` at it here. Do you mean without 
deference or after inspection by dereference?  For LSIDs the contract is 
clear without dereference, but for URLs I cannot see how that can be true.
 
> cache and indirect.
> 
> More generally any social convention that we use can accomplish the 
> same thing - a provider could say (in a robots.txt-like file, or as a 
> published policy) that certain paths in its tree have this sort of 
> metadata available and should be treated like an lsid would.

Again this requires standards & infrastructure to interpret and apply the 
difference contracts, particularly if this must be machine-readable. The 
more sophisticated and/or `wooly` it is, the less likely we are to see 
adoption. Each time one retrieves an object one would need to check (and 
perhaps store) the contract/policy too. Comparisons simply cannot be made 
of URI simple name strings to determine equality of the object named. 

Finally, I would like to add an unrelated comment about one of the 
practical aspects of LSIDs that we find useful here. This is in the area 
of local/distributed/offline vs. online/centralized naming and access. 
Because the LSID named object is not tied to any particular place or 
protocol, objects can be created and accessed locally on ones own machine 
(perhaps offline) using exactly the same name that they will be accessed 
with when they are uploaded and made public to a wider group or to the 
internet as a whole. Software we write for locally creating or accessing 
LSID data can be the same as that for accessing LSIDs across the network 
and it makes no difference whether the LSIDs have been uploaded or not. 
One has none of the worries of maintaining a relative link structures or 
hard coding and then having to recode URL absolute references or even 
finding one now has to use a new (longer) name once the object is uploaded 
to a distribution service end-point.

Kindest regards, Sean

PS, I was amused to recently realize the irony of you playing the 
(extremely useful) part of devil's advocate on this topic. I don?t know if 
you realize it, but my understanding is that the original LSID  was based 
on work at Millennium. ;-)


--

Sean Martin
IBM Corp.
Received on Wednesday, 12 July 2006 13:16:50 UTC