Re: URLs/LSID/RDF etc. from Sean Martin on 2004-04-22 (public-semweb-lifesci@w3.org from April 2004)

From: Sean Martin <sjmm@us.ibm.com>
Date: Thu, 22 Apr 2004 09:03:06 -0400
To: public-semweb-lifesci@w3.org
Cc: Greg Tyrelle <greg@tyrelle.net>
Message-ID: <OF5FA3CB60.89833217-ON85256E7E.00471C85-85256E7E.0047B24E@us.ibm.com>
GT>Encoding LSID as a HTTP URIs seems to be a way forward. Maybe some
GT>kind of mapping:
GT>
GT>URN:LSID:example.com:12345:1
GT>
GT>http://example.com.lsid.org/12345/1

Actually underneath the covers at a current LSID authority this is what 
most often happens. The LSID resolver spec. and software stacks were 
designed exactly to allow those providing data make this mapping... it 
allows for other protocols as well as HTTP (file://, ftp:// or SOAP etc) 
as well as for specifying multiple sources for the data (and meta-data). 
The difference though is that the data provider controls the mapping, not 
a convention applied by someone using the LSID locally. This means the 
data provider can remake the map any time they like and nobody gets 
broken.

Kindest regards, Sean

--
Sean Martin
IBM Corp




Greg Tyrelle <greg@tyrelle.net> 
Sent by: public-semweb-lifesci-request@w3.org
04/22/2004 02:07 AM

To
Brian Gilman <gilmanb@jforge.net>
cc
public-semweb-lifesci@w3.org
Subject
Re: URLs/LSID/RDF etc.







*** Brian Gilman wrote:
|       1) URI's are a nightmare in the lifesciences. Particularly when
|used to encode semantic information about a particular entity that
|exists on the web. For example (from the DAS 1.0 specification):
|
|       '/wormbase/das/elegans/features?segment=CHROMOSOME_I:1000,2000'
|
|
|       This leads the programmer and biologist to certain conclusions
|about query semantics ie. what this URI encodes and  (perhaps) what the
|programmer meant when using a certain encoding scheme. People infer
|meaning from a URI and learn this semantic very quickly.  Some would
|argue that this is a good thing however, once the biologist trains
|themselves on this type of system, the developers of these systems are
|forever locked into this scheme of identification.  This will forever
|become the identifier for this entity. In the case noted above, this is
|particularly cumbersome: If a researcher has started to annotate this
|region of the chromosome with metadata and the underlying data changes

Both LSIDs and URLs are URIs, in which case they are intended to be
opaque identifiers. You are not meant to infer anything about the
resource from the URI ? I believe this is a case of using URIs
incorrectly [1], not that HTTP URIs are broken.

|As with any scientific data, there must be a way to reasonably
|reproduce the evidence that lead to a particular result or hypothesis.

My main issue is the use of LSID to identify biological objects or
concepts in RDF (the example is borrowed from one of Eric Neumann's
previous emails to the list):

:A9311 a :Annotation ;
dc:creator "Jonathan Smythe" ;
affx:hypothesizes {
:Gene5 a affx:Gene ;
affx:hasVariant [
affx:representedBy :gi9887088;
affx:process "GO:0006306";
affx:associatedWith "omim:209920"
] .
} .

If my "agent" is to add this hypothesis to it's KB, I might instruct
it to find more information about the processes involved (assuming I
don't already have this knowledge). If the GO terms and GIs were HTTP
URIs I can dereference them to (hopefully) retrieve some useful
information about those resources. However with LSID I must have the
necessary infrastructure in place (resolvers, clients etc.).

I have ignored the issue of retrieving the "object" vs. a description
of the "object" in this case. But either way I have a better chance with
HTTP. True the LSID spec defines mechanisms for getting descriptions of
LSIDs via web services etc. I should investigate the implementations...

|By encoding things with URI's we do not guard against the fact that the
|underlying data may change.

Why do we need to guard against the underlying data changing ?

|       This leads me to a question about "persistent" URI's and URL's
|(PURLS's): How do you ensure that two URI's are pointing at the same
|object (bytes)? If we can collectively answer this question we can
|encode an LSID any way we please as long as we keep in mind that this
|information must persist as long as a journal or other well vetted
|scientific medium.

You will not be able to "technically" insure two URIs are not pointing
at the same object using LSID or HTTP URIs IMO. Also if LSIDs are
going to be use to identify "concepts", what is to say that two
authorities will have LSIDs for the concept p53 ? This is especially
important considering their use in RDF to identify "resources".

Encoding LSID as a HTTP URIs seems to be a way forward. Maybe some
kind of mapping:

URN:LSID:example.com:12345:1

http://example.com.lsid.org/12345/1

|       2) (sorry to be repetitive) Scientists typically perform
|research on the web as a supplemental exercise. By this, I mean that
|most researchers use data gathered from the web to enhance their
|knowledge about a certain gene, protein, transcript, chemical etc. This
|data is not typically referenced in a journal article etc. If we want
|to allow for the incorporation and dissemination of  scientific
|information and knowledge across the internet as a common means of
|communication we need to ensure two things:
|
|                       a) Persistence
|                       b) Provenance
|
|       Science requires that an experiment be reproducible by other
|researchers and that the discoverer/institution get credit for the
|discovery made or technique used to make the discovery. We must pay
|particular attention to this as we craft the LSID specification.

I agree that scientists typically "Perform research on the web" as a
supplement to "real experiments". This is edging towards "systems
biology" i.e. how to best utilise our existing knowledge to better
target our experimental approaches (my understanding of it). That
being said LSID URNs are not part of the web, they don't use web
infrastructure.

I maintain that persistence is an organisational issue and not
necessarily technical. There is both a social and trust component of
persistence e.g. I trust that the NCBI will maintain persistent URIs
for their records (maybe not :)) or a LSID authority for that matter.

|       3) Browsers, HTTP semantics of query, RESTful interfaces, etc.
|are secondary to how data is used in the industry. Having a resolver to
|get at a particular piece of information should not be a huge barrier
|to the LSID specification's adoption. Case in point, IBM's
|implementation of LSID utilizes a COM plugin to allow users to perform
|LSID queries from a web browser. ie.
|lsid://<authority>:<namespace>:<identifier>:<version>

I'm not sure what to make of this, points two and three seem to
contradict each other. Scientists performing "research on the web" use
browsers, HTTP, and RESTful interfaces. However the web (Browsers,
HTTP, REST) are "secondary to how data is used in industry" ?

Again, a life sciences identifier scheme that is not part of the web
is less useful to me. But maybe to industry perhaps ?

My intention here is not to be harshly critical of the LSID spec or
the points you have made. I am very sympathetic to the view that
persistent identifiers be widely adopted in the life sciences. I am
just trying to figure out best practice use for LSIDs in RDF etc.

|       I hope this helps. I'll be posting specific examples of LSID in
|RDF in the next few weeks which I hope will help clarify this issue
|further.

Some examples and further discussion on identifiers (LSID) for life
sciences as they pertain to the semantic web would be great.

_greg

[1] http://www.w3.org/Provider/Style/URI

_greg

--
Greg Tyrelle
Received on Thursday, 22 April 2004 09:04:49 UTC