Fw: Use of LSIDs in RDF (fwd)

hi Everyone,
As yet another person who worked on the LSID spec (dereferencing scheme 
and the addition of RDF meta-data discovery & retrieval), I have a 
supplement to Brian's question.

BG>This leads me to a question about "persistent" URI's and URL's
 BG>(PURLS's): How do you ensure that two URI's are pointing at the same
 BG>object (bytes)? 

My question is how does one programmatically identify a persistent HTTP 
URI, as opposed to one that will retrieve tomorrow's weather or perhaps 
retrieve a file from a P2P network or one that returns dynamically 
changing content? Apologies in advance if there is an obvious answer to 
this question. 

As to the question Greg originally asked which is why invent anything new 
since we already have HTTP URI's, my short answer is that they did not 
seem to be sufficient in themselves to address the problems that the LSID 
scheme was designed for. The scheme devised does of course lean heavily on 
HTTP URI's as probably the primary method of retrieval of the data object 
or meta-data about that object - after all much of the public LS data is 
actually out there on the web already retrievable by HTTP URI. If HTTP 
URI's were sufficient today, we would not have need of the LSID. So 
perhaps the question you should ask your self is why are people not 
already widely using URL's for LS naming?

For me the main points are: 
Location independence of the object named - the extra layer of indirection 
makes this flexibility possible - there is a starting assumption that 
users will make/exchange local copies of the objects and also that 
authority entities will at some point want to transfer the authority over 
a LSID to another authority entity - while potentially maintaining control 
of their domain name, sometimes the same data is served from more than one 
"official" place on the web(e.g. Swiss-Prot - Marja, how does Annotea deal 
with this situation?), having the option of not using domain names in the 
identifier at all; 

Providing/using LSID's for one's data establishes a "contract" in which 
certain properties can be assumed (beyond those of the HTTP URI 
"contract") of an LSID named object:
defines what can safely be assumed about multiple copies of objects which 
have the same LSID name - i.e. that they are identical; clear definition 
of what persistence means [both availability and never modifying a named 
object]; 
a formal mechanism for retrieving data [never ever changes] over multiple 
protocols and discovering and retrieving meta-data [which can change] 
about that object and its relationship to other objects [from the original 
source of the object or from a third-party who has something to add of 
their own] all using a single globally unique name.

One parting thought.. widespread adoption of LSID spec. across the 
industry will at the same time create a very large semantic web. 

Kindest regards, Sean

--
Sean Martin
IBM Corp.

---------- Forwarded message ----------
Date: Mon, 19 Apr 2004 12:42:46 -0400
From: Brian Gilman <gilmanb@pantherinformatics.com>
To: Greg Tyrelle <greg@tyrelle.net>
Cc: Martin Senger <senger@ebi.ac.uk>, public-semweb-lifesci@w3.org,
Marja-Riitta Koivunen <marja@annotea.org>
Subject: Re: Use of LSIDs in RDF

Hello Everyone,

I'm not an expert on URI's but, I am an author on the LSID
specification and would like to clarify some issues.

1) URI's are a nightmare in the lifesciences. Particularly when used
to encode semantic information about a particular entity that exists on
the web. For example (from the DAS 1.0 specification):

'/wormbase/das/elegans/features?segment=CHROMOSOME_I:1000,2000'

This leads the programmer and biologist to certain conclusions about
query semantics ie. what this URI encodes and  (perhaps) what the
programmer meant when using a certain encoding scheme. People infer
meaning from a URI and learn this semantic very quickly.  Some would
argue that this is a good thing however, once the biologist trains
themselves on this type of system, the developers of these systems are
forever locked into this scheme of identification.  This will forever
become the identifier for this entity. In the case noted above, this is
particularly cumbersome: If a researcher has started to annotate this
region of the chromosome with metadata and the underlying data changes
As with any scientific data, there must be a way to reasonably
reproduce the evidence that lead to a particular result or hypothesis.
By encoding things with URI's we do not guard against the fact that the
underlying data may change.

This leads me to a question about "persistent" URI's and URL's
(PURLS's): How do you ensure that two URI's are pointing at the same
object (bytes)? If we can collectively answer this question we can
encode an LSID any way we please as long as we keep in mind that this
information must persist as long as a journal or other well vetted
scientific medium.

2) (sorry to be repetitive) Scientists typically perform research on
the web as a supplemental exercise. By this, I mean that most
researchers use data gathered from the web to enhance their knowledge
about a certain gene, protein, transcript, chemical etc. This data is
not typically referenced in a journal article etc. If we want to allow
for the incorporation and dissemination of  scientific information and
knowledge across the internet as a common means of communication we
need to ensure two things:

a) Persistence
b) Provenance

Science requires that an experiment be reproducible by other
researchers and that the discoverer/institution get credit for the
discovery made or technique used to make the discovery. We must pay
particular attention to this as we craft the LSID specification.

3) Browsers, HTTP semantics of query, RESTful interfaces, etc. are
secondary to how data is used in the industry. Having a resolver to get
at a particular piece of information should not be a huge barrier to
the LSID specification's adoption. Case in point, IBM's implementation
of LSID utilizes a COM plugin to allow users to perform LSID queries
from a web browser. ie.
lsid://<authority>:<namespace>:<identifier>:<version>


I hope this helps. I'll be posting specific examples of LSID in RDF in
the next few weeks which I hope will help clarify this issue further.

Best,

-B


--
Brian Gilman
President Panther Informatics Inc.
9 Acadia Park #2
Somerville, MA 02143
Phone 617-335-8276
E-Mail: gilmanb@pantherinformatics.com
gilmanb@jforge.net
AIM: gilmanb1

01000010 01101001 01101111
01001001 01101110 01100110
01101111 01110010 01101101
01100001 01110100 01101001
01100011 01101001 01100001
01101110

Confidentiality Notice

This transmission and the documents contained herein are confidential
and privileged.  The transmission and the  documents are intended only
for the individuals or entities named above.  If you are not the
intended recipient, any disclosure, copying, distribution or use of
this transmission is prohibited.  If you received this transmission in
error, please contact us
immediately so that we may arrange for its return.
On Apr 15, 2004, at 2:27 AM, Greg Tyrelle wrote:

>
> *** Marja-Riitta Koivunen wrote:
>   |>   I am not sure how to answer this ultimate question. Perhaps I
> need to
>   |>understand more about HTTP URIs in order to give comparison with
> the URN
>   |>used in LSID spec. To be honest I have tried to find more and I
> gave up
>   |>after reading very nice article about HTTP URIs by Tim Berners-Lee
>   |>(http://www.w3.org/DesignIssues/HTTP-URI.html) that gave me
> feeling that I
>   |>am out of the league :-(
>
> I am by no means an expert on this either :)
>
> How URIs are used in the web architecture and the semantic web
> architecture are contentious issues to say the least. Given the
> importance of standardisation for the life sciences e.g. MAGE, I am
> simply trying to understand how identifier schemes such as LSID fit
> into the current thinking about the semantic web and URIs.
>
>   |I think the question is mainly why reinvent a  wheel that already
> exixts.
>
> Precisely.
>
>   |Using persistent HTTP URIs is a good goal because it is standard
> and there
>   |exists a lot of HTTP based applications e.g. browsers that
> understand HTTP
>   |URIs and can provide information of the resource on the Web without
>   |anything extra.
>
> This is the exactly the context in which I was trying to raise the
> issue of using LSIDs in RDF. Technically speaking there is nothing
> wrong with the current LSID specification IMO. However if I want to
> allow other users to dereference LSIDs that my authority mints, it
> requires me to maintain an LSID resolver. Clients must also have the
> necessary libraries to dereference the LSID.
>
> Having this infrastructure in place adds a practical burden to using
> LSIDs which would go away if LSIDs were to be specified in terms of
> HTTP URIs. Which, if we assume persistence is an organisational issue,
> then HTTP URIs are just as good as URNs. While the RDF spec says
> nothing about URIs being dereferenced to provide representations, my
> practical expectation is that they should i.e. it makes that resource
> more useful if I can get either a representation of that resource or
> metadata about that resource.
>
> The LSID does specify interfaces for retrieving metadata about a LSID
> which is a good thing. However I'll leave the "how to get metadata
> about a resource" question for later...
>
> This leads of course to the thorny issue of whether HTTP URIs are
> names or locations or both. My simple view on this is that a HTTP URI
> is a name in the same sense that the LSIDs are names, however it *may*
> also be dereferenced to provide a representation of the resource that
> it is naming (using the existing web infrastructure).
>
>   |When somebody defines a URN they can usually as well reserve a HTTP
> URI
>   |e.g. http://www.lsid.org/ and define URIs under it e.g.
>   |http://www.lsid.org/path.../enzyme1.
>
> This maybe one possible solution, however the LSID spec says nothing
> about this.
>
>   |Why we are now ALSO discussing of using UUID URNs is because we have
>   |problems with local file: type URIs (file: URIs e.g.
> file//marja/myfile123)
>   |and we want to make them unambiguous when a user cannot do HTTP
> URIs maybe
>   |because a user does not own a http:// domain name where to publish
> it or
>   |for some other reason.
>
> I can see this as an issue for individuals creating local stores of
> annotated bookmarks. In the case of the life sciences it would be
> easier to control as any authority using HTTP URI based LSIDs would
> need to have a http:// domain name to participate.
>
>   |After publishing we mostly want to use the HTTP URIs to be able to
> benefit
>   |from the common Web standards. But it is also possible for some
> application
>   |to benefit from the URN bit of the information when so wished.
>
> Maybe there is a best of both worlds approach ?
>
> _greg
>
> --
> Greg Tyrelle
>

Received on Wednesday, 21 April 2004 22:12:26 UTC