Re: [BioRDF] All about the LSID URI/URN from Dan Connolly on 2006-07-27 (public-semweb-lifesci@w3.org from July 2006)

From: Dan Connolly <connolly@w3.org>
Date: Thu, 27 Jul 2006 10:12:45 -0500
To: Sean Martin <sjmm@us.ibm.com>
Cc: public-semweb-lifesci@w3.org
Message-Id: <1154013166.8748.620.camel@dirk.w3.org>
On Fri, 2006-07-07 at 12:42 -0400, Sean Martin wrote:
> 
> As background, it might help to know that one of the earliest
> requirements of the I3C's LSID group was that bio/pharma companies be
> able to copy large chunks of public databases internally so as to be
> able to access them repeatedly in private. They also wanted to be
> certain of exactly what version of the information they were using as
> it changed frequently.

Thanks for continuing to explain the requirements. I haven't seen
LSID requirements that can't be met with http/DNS yet, but that
doesn't mean they're not there.

As to these two, I do both of those with http URIs pretty routinely.
I tend to do it in somewhat ad-hoc fashion, but the OASIS
catalog spec is one fairly standardized mechanism for dereferencing
arbitrary URIs (including http URIs) from offline copies.
http://www.oasis-open.org/committees/entity/spec.html

As to versioning, I gave the example of W3C tech reports that
are guaranteed not to change once published. The W3C process document
is the main mechanism by which we distribute the versioning policy;
the http expires header is consistent with the "never changes" policy,
but HTTP can only express "doesn't change this year". Mailing
list archives are another example. Lots of sites publish mailing
list archives with an implicit (but effectively communicated)
policy that they won't change the contents of the message once
published.



>  The LSID URN scheme includes the means to allow them to do both of
> these seamlessly.

>  Scientists are able to use LSIDs to name or identify objects received
> via mail from a colleague, resolved over the web, retrieved from a
> copy archive or simply copied out of a file system,

Where you say "resolved over the web", I would say "resolved
using HTTP from the origin server"; mail from a colleague counts
as "resolved over the web" to me, as long as the mail is authorized
by (or at least consistent with the intent of) the party issuing
the URI. Likewise copy archive, file system, etc.

>  without ambiguity because the location of the object is unimportant
> and the naming syntax provides for versioning information. There can
> be no doubts, no further complex namespace rules from the provider of
> that data that need to be consulted. In short, a machine can easily be
> made to do it all. Perhaps it might help to think about LSIDs more in
> the taking a couple of copies here and there sense rather than the web
> proxy caching sense.  

If you don't mean to use LSIDs in the web proxy caching sense, then
it's not clear to me why you used URIs at all. Either these things
are in URI space, in which case the associated data is quite
analagous to a cached http response (or ftp response...), or they're
in some other space altogether.


> Anyway a couple of embedded comments included below: 
> 
> public-semweb-lifesci-request@w3.org wrote on 07/07/2006 08:57:33 AM: 
> 
> >  
> >
> http://lists.w3.org/Archives/Public/public-semweb-lifesci/2006Jun/0210.html 
> >  
> > > The root of the problem is that the URL  
> > > contains in it more than just a name. It also contains the
> network  
> > > location where the only copy of the named object can be found
> (this is the  
> > > hostname or ip address) 
> >  
> > Which URL is that? It's not true of all URLs. Take, for example, 
> >   http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/ 
> >  
> > That URL does not contain the network location where the only 
> > copy can be found; there are several copies on mirrors around the 
> > globe. 
> >  
> > $ host www.w3.org 
> > www.w3.org has address 128.30.52.46 
> > www.w3.org has address 193.51.208.69 
> > www.w3.org has address 193.51.208.70 
> > www.w3.org has address 128.30.52.31 
> > www.w3.org has address 128.30.52.45 
> > 
> 
> Can you explain this a little further please Dan? If perhaps you mean
> that the W3C has mirrors and a DNS server that responds with the
> appropriate IP addresses depending on where you are coming from or
> which servers are available or have less load,

yes.

>  I agree that the URL points to multiple copies of the object but any
> single access path to that object is always determined by the URL
> issuing authority.

Yes.

Can LSIDs be dereferenced in ways not authorized by
the issuing authority?

>  I actually wrote the original code to do just this for IBM sports web
> sites in the mid 90's! I am sure though that you will appreciate that
> this is not at all the same thing as being able to actively source the
> named object from multiple places, where the client side chooses the
> both the location and the means of access (protocol) and that this can
> still be done if the original issuing authority for the object goes
> away.

I spent a few days thinking about it, and no, I don't see any
particularly relevant differences.

>  From the client’s point of view, with a URL the protocol and the
> location are fixed and if either disappears the client has no way to
> ask anyone else for a copy.

No? If you "copy large chunks of public databases internally" then your
client can access those chunks a la caching.

If you don't have any access to any data that was ever provided
by the issuing authority, then yes, you lose. But that seems
by design. The only alternative is that the issuing authority
has no privileged claim on what the correct data is, and anybody
who claims to have data for some LSID (whether for malicious
reasons or otherwise) is just as correct as anybody else.

>  In my original post my thoughts were for the second of these meanings
> as the first has been obviously in practice for over a decade now.
> Sorry for not being  explicit earlier. 
> 
> >  
> >  
> > FYI, the TAG is working on a finding on URNs, Namespaces, and
> Registries; 
> > the current draft has a brief treatment of this issue of location  
> > (in)dependence... 
> >
> http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#loc_independent 
> >  
> >  
> > > as well as the only means by which one may  
> > > retrieve it (the protocol, usually http, https or ftp). The first
> question  
> > > to ask yourself here is that when you are uniquely naming (in all
> of space  
> > > and time!) a file/digital object which will be usefully copied far
> and  
> > > wide, does it make sense to include as an integral part of that
> name the  
> > > only protocol by which it can ever be accessed and the only place
> where  
> > > one can find that copy? 
> >  
> > If a better protocol comes along, odds are good that it will be
> usable 
> > with names starting with http: 
> 
> I am not sure I understand how this can be possible. Sure for evolved
> HTTP perhaps, but for protocols that have not yet been conceived I am
> not so sanguin. 

How is "evolved HTTP" different from protocols that have not been
conceived? HTTP 1.1 had not been conceived when the http URI scheme
was originally deployed. Actually, HTTP 1.0 had not even been conceived.
In the original HTTP protocol, the response had no RFC822 style
headers but only SGML style <> stuff. The binding between the http:
scheme name and the protocol used on the wire has evolved considerably,
independent of all the http: links that are in place.

This is not to mention the variety of less standard practices, like
the squid cache protocol, akamai's DNS tricks, etc.


> >  
> > See section 2.3 Protocol Independence 
> >
> http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#protocol_independent 
> >  
> 
> hmm I am not sure I can buy the argument at the above link yet. Is
> this even an argument?  ....because myURIs always map to http anyway
> it is the same as if it were http, so why bother..? 

The point is that if new protocols emerge that have the same pattern
of an administrative hierarchy (a la DNS) followed by a string/path,
then they can be used with http: URIs as well as with any other URIs.

For example, DDNS works just fine with http: URIs.

> The main difference as far as I can see is that the mapping provides a
> level of indirection. This seems quite a significant difference and
> may be the point of having a myURI in the first place. The intention
> no doubt being to leave room for other protocols as they emerge and
> not tie a name to a single one as well as provide flexibility for
> actual deployment.  In my experience indirection is the great friend
> of people doing actual deployments. Also in this case protocol
> includes not just the technical TCP socket connection and GET headers
> etc, but also has to include the issues surrounding domain ownership
> too which are part of the resolution process.  While we may be
> reasonably certain about the technical issues, the uncertainties of
> tying ones long term identifier to a hostname (even a virtual one like
> www.w3.org ) are considerable and in the face of this a layer of
> indirection begins to look quite prudent. 

I spent some time reading the LSIDs spec (I presume
http://www.omg.org/cgi-bin/doc?dtc/04-05-01 is the right one),
and I don't see anything novel about the social structure. The
examples I found all had DNS domain names in there somehwere. 
Could you help me understand what's novel about the social structure
around LSIDs?


> Also note that this is not about just pure naming since retrieval is
> explicitly intended for both data and metadata from multiple sources.
> LSIDs are already mapped to multiple protocols (which would not be
> possible if you did not have indirection),

It is possible. HTTP URIs are already mapped to multiple protocols
too: squid caching, OASIS catalog lookup, etc.

The http: naming scheme has sufficient indirection, as long
as the naming system consists of an administrative hierarchy followed
by a path/string.

>  certainly this includes http URLs but also ftp & file:// URL's for
> wide area file systems as well as SOAP (which itself supports multiple
> transport protocols).  The LSID spec explicitly allows for the client
> to accumulate metadata from multiple metadata stores using multiple
> protocols without duplication using just the single URN. 
> >  
> > > Unfortunately when it 
> > > comes to URL?s there is no way to know that what is served one day
> will be  
> > > served out the next simply by looking at the URL string. There is
> no  
> > > social convention or technical contract to support the behavior
> that would  
> > > be required. 
> >  
> > Again, that's not true for all URLs. There are social and technical 
> > means to establish that 
> >  
> >   http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/ 
> >  
> > can be cached for a long time. 
> 
> Yes, but which URLs? My original post went on to say: 
> 
> -- 
> `One type of URL response may be happily cached, perhaps for ever, the
> other type probably should not, but to a machine program the URL looks
> the same and without recourse to an error prone set of heuristics

Why must the mechanism for determining the lifetime of the binding
between a URI and the data be "an error prone set of heuristics"?
Why can it not be a well-designed community standard?

>  it is extremely difficult perhaps impossible to programmatically tell
> the difference. Given that what we are designing is meant to be a
> machine readable web, it is vital to know which URLs behave the way we
> want and which do not. Can one programmatically tell the difference
> without actually accessing them? Can one programmatically tell the
> difference even after accessing them?` 
> -- 
> 
> Perhaps I should have written that `it is vital to know which URIs
> behave the way we want and which do not`. You fairly responded that
> HTTP has an expires header and that the social conventions around how
> to split up a URL [and what meaning the various parts of the
> substrings have and what this implies for their shelf lives or
> versioning] can be written and published - perhaps even in a machine
> readable form.  But for automation one would need to dereference even
> this at some point and load it into the machines understanding stack
> somehow. For the time being one would need to program in the different
> heuristics for every individual data source.  A long road to hoe I
> think, and one that would likely defeat the objective of having a
> simple stack that can fetch the data given an identifier. We would be
> kidding ourselves if we did not acknowledge serious adoption problems.

Any approach you take is going to have adoption problems. It's not
like the existing LSID approach was trivial to deploy, was it?

But when considering best practices for how this should be done next
time, let's please consider whether something that interoperates
with existing http/DNS clients is no more challenging to deploy
and gives considerably more benefit.

>   Note that I was optimistic on the `cached, perhaps for ever`
> statement, because as you note http expires only supports caching for
> a year. Does this mean that the object named could change after a
> year? (Who knows). This would be a problem for this community for both
> scientific and legal reasons. Of course this is one area that has some
> reasonably easy fixes.   
> 
> Given both the mixed up social contract (is it an RCP transport or a
> permanent [!?] document name,  how does one version it, and who owns
> the hostname now)  baggage surrounding HTTP URLs  as well as a number
> of  technical short comings given the communities requirements, it is
> not hard to see how the idea of a  new URN with its own specialized
> technical and social contracts provided a fresh start and yet still
>  mapped down onto existing internet infrastructure. 

Yes, it's easy to see how starting fresh simplified some things.
But I am not convinced that starting fresh is the only option,
nor that working within the constraints of http/DNS won't give
a lot more benefit for approximately the same investment.


-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
Received on Thursday, 27 July 2006 15:13:01 UTC