Re: [BioRDF] All about the LSID URI/URN from Sean Martin on 2006-07-07 (public-semweb-lifesci@w3.org from July 2006)

From: Sean Martin <sjmm@us.ibm.com>
Date: Fri, 7 Jul 2006 12:42:13 -0400
To: public-semweb-lifesci@w3.org
Cc: Dan Connolly <connolly@w3.org>
Message-ID: <OFE412C5EA.A1882AE6-ON852571A4.0048054A-852571A4.005BAE4F@us.ibm.com>
As background, it might help to know that one of the earliest requirements 
of the I3C's LSID group was that bio/pharma companies be able to copy 
large chunks of public databases internally so as to be able to access 
them repeatedly in private. They also wanted to be certain of exactly what 
version of the information they were using as it changed frequently. The 
LSID URN scheme includes the means to allow them to do both of these 
seamlessly. Scientists are able to use LSIDs to name or identify objects 
received via mail from a colleague, resolved over the web, retrieved from 
a copy archive or simply copied out of a file system, without ambiguity 
because the location of the object is unimportant and the naming syntax 
provides for versioning information. There can be no doubts, no further 
complex namespace rules from the provider of that data that need to be 
consulted. In short, a machine can easily be made to do it all. Perhaps it 
might help to think about LSIDs more in the taking a couple of copies here 
and there sense rather than the web proxy caching sense. 


Anyway a couple of embedded comments included below:

public-semweb-lifesci-request@w3.org wrote on 07/07/2006 08:57:33 AM:

> 
> 
http://lists.w3.org/Archives/Public/public-semweb-lifesci/2006Jun/0210.html
> 
> > The root of the problem is that the URL 
> > contains in it more than just a name. It also contains the network 
> > location where the only copy of the named object can be found (this is 
the 
> > hostname or ip address) 
> 
> Which URL is that? It's not true of all URLs. Take, for example,
>   http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/
> 
> That URL does not contain the network location where the only
> copy can be found; there are several copies on mirrors around the
> globe.
> 
> $ host www.w3.org
> www.w3.org has address 128.30.52.46
> www.w3.org has address 193.51.208.69
> www.w3.org has address 193.51.208.70
> www.w3.org has address 128.30.52.31
> www.w3.org has address 128.30.52.45
>

Can you explain this a little further please Dan? If perhaps you mean that 
the W3C has mirrors and a DNS server that responds with the appropriate IP 
addresses depending on where you are coming from or which servers are 
available or have less load, I agree that the URL points to multiple 
copies of the object but any single access path to that object is always 
determined by the URL issuing authority. I actually wrote the original 
code to do just this for IBM sports web sites in the mid 90's! I am sure 
though that you will appreciate that this is not at all the same thing as 
being able to actively source the named object from multiple places, where 
the client side chooses the both the location and the means of access 
(protocol) and that this can still be done if the original issuing 
authority for the object goes away. From the client?s point of view, with 
a URL the protocol and the location are fixed and if either disappears the 
client has no way to ask anyone else for a copy. In my original post my 
thoughts were for the second of these meanings as the first has been 
obviously in practice for over a decade now. Sorry for not being  explicit 
earlier.

> 
> 
> FYI, the TAG is working on a finding on URNs, Namespaces, and 
Registries;
> the current draft has a brief treatment of this issue of location 
> (in)dependence...
> http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#loc_independent
> 
> 
> > as well as the only means by which one may 
> > retrieve it (the protocol, usually http, https or ftp). The first 
question 
> > to ask yourself here is that when you are uniquely naming (in all of 
space 
> > and time!) a file/digital object which will be usefully copied far and 

> > wide, does it make sense to include as an integral part of that name 
the 
> > only protocol by which it can ever be accessed and the only place 
where 
> > one can find that copy?
> 
> If a better protocol comes along, odds are good that it will be usable
> with names starting with http: 

I am not sure I understand how this can be possible. Sure for evolved HTTP 
perhaps, but for protocols that have not yet been conceived I am not so 
sanguin.

> 
> See section 2.3 Protocol Independence
> 
http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#protocol_independent
> 

hmm I am not sure I can buy the argument at the above link yet. Is this 
even an argument?  ....because myURIs always map to http anyway it is the 
same as if it were http, so why bother..?

The main difference as far as I can see is that the mapping provides a 
level of indirection. This seems quite a significant difference and may be 
the point of having a myURI in the first place. The intention no doubt 
being to leave room for other protocols as they emerge and not tie a name 
to a single one as well as provide flexibility for actual deployment.  In 
my experience indirection is the great friend of people doing actual 
deployments. Also in this case protocol includes not just the technical 
TCP socket connection and GET headers etc, but also has to include the 
issues surrounding domain ownership too which are part of the resolution 
process.  While we may be reasonably certain about the technical issues, 
the uncertainties of tying ones long term identifier to a hostname (even a 
virtual one like www.w3.org ) are considerable and in the face of this a 
layer of indirection begins to look quite prudent. 

Also note that this is not about just pure naming since retrieval is 
explicitly intended for both data and metadata from multiple sources. 
LSIDs are already mapped to multiple protocols (which would not be 
possible if you did not have indirection), certainly this includes http 
URLs but also ftp & file:// URL's for wide area file systems as well as 
SOAP (which itself supports multiple transport protocols).  The LSID spec 
explicitly allows for the client to accumulate metadata from multiple 
metadata stores using multiple protocols without duplication using just 
the single URN.
> 
> > Unfortunately when it 
> > comes to URL?s there is no way to know that what is served one day 
will be 
> > served out the next simply by looking at the URL string. There is no 
> > social convention or technical contract to support the behavior that 
would 
> > be required.
> 
> Again, that's not true for all URLs. There are social and technical
> means to establish that
> 
>   http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/
> 
> can be cached for a long time.

Yes, but which URLs? My original post went on to say:

--
`One type of URL response may be happily cached, perhaps for ever, the 
other type probably should not, but to a machine program the URL looks the 
same and without recourse to an error prone set of heuristics it is 
extremely difficult perhaps impossible to programmatically tell the 
difference. Given that what we are designing is meant to be a machine 
readable web, it is vital to know which URLs behave the way we want and 
which do not. Can one programmatically tell the difference without 
actually accessing them? Can one programmatically tell the difference even 
after accessing them?`
--

Perhaps I should have written that `it is vital to know which URIs behave 
the way we want and which do not`. You fairly responded that HTTP has an 
expires header and that the social conventions around how to split up a 
URL [and what meaning the various parts of the substrings have and what 
this implies for their shelf lives or versioning] can be written and 
published - perhaps even in a machine readable form.  But for automation 
one would need to dereference even this at some point and load it into the 
machines understanding stack somehow. For the time being one would need to 
program in the different heuristics for every individual data source.  A 
long road to hoe I think, and one that would likely defeat the objective 
of having a simple stack that can fetch the data given an identifier. We 
would be kidding ourselves if we did not acknowledge serious adoption 
problems.  Note that I was optimistic on the `cached, perhaps for ever` 
statement, because as you note http expires only supports caching for a 
year. Does this mean that the object named could change after a year? (Who 
knows). This would be a problem for this community for both scientific and 
legal reasons. Of course this is one area that has some reasonably easy 
fixes. 

Given both the mixed up social contract (is it an RCP transport or a 
permanent [!?] document name,  how does one version it, and who owns the 
hostname now)  baggage surrounding HTTP URLs  as well as a number of 
technical short comings given the communities requirements, it is not hard 
to see how the idea of a  new URN with its own specialized technical and 
social contracts provided a fresh start and yet still  mapped down onto 
existing internet infrastructure.


Kindest regards, Sean

--
Sean Martin
IBM Corp
Received on Friday, 7 July 2006 16:42:34 UTC