URLs/LSID/RDF etc.

Hello Everyone,

	I'm not an expert on URI's but, I am an author on the LSID 
specification and would like to clarify some issues.

	1) URI's are a nightmare in the lifesciences. Particularly when used 
to encode semantic information about a particular entity that exists on 
the web. For example (from the DAS 1.0 specification):

	'/wormbase/das/elegans/features?segment=CHROMOSOME_I:1000,2000'

	This leads the programmer and biologist to certain conclusions about 
query semantics ie. what this URI encodes and  (perhaps) what the 
programmer meant when using a certain encoding scheme. People infer 
meaning from a URI and learn this semantic very quickly.  Some would 
argue that this is a good thing however, once the biologist trains 
themselves on this type of system, the developers of these systems are 
forever locked into this scheme of identification.  This will forever 
become the identifier for this entity. In the case noted above, this is 
particularly cumbersome: If a researcher has started to annotate this 
region of the chromosome with metadata and the underlying data changes 
As with any scientific data, there must be a way to reasonably 
reproduce the evidence that lead to a particular result or hypothesis. 
By encoding things with URI's we do not guard against the fact that the 
underlying data may change.

	This leads me to a question about "persistent" URI's and URL's  
(PURLS's): How do you ensure that two URI's are pointing at the same 
object (bytes)? If we can collectively answer this question we can 
encode an LSID any way we please as long as we keep in mind that this 
information must persist as long as a journal or other well vetted 
scientific medium.

	2) (sorry to be repetitive) Scientists typically perform research on 
the web as a supplemental exercise. By this, I mean that most 
researchers use data gathered from the web to enhance their knowledge 
about a certain gene, protein, transcript, chemical etc. This data is 
not typically referenced in a journal article etc. If we want to allow 
for the incorporation and dissemination of  scientific information and 
knowledge across the internet as a common means of communication we 
need to ensure two things:

			a) Persistence
			b) Provenance

	Science requires that an experiment be reproducible by other 
researchers and that the discoverer/institution get credit for the 
discovery made or technique used to make the discovery. We must pay 
particular attention to this as we craft the LSID specification.

	3) Browsers, HTTP semantics of query, RESTful interfaces, etc. are 
secondary to how data is used in the industry. Having a resolver to get 
at a particular piece of information should not be a huge barrier to 
the LSID specification's adoption. Case in point, IBM's implementation 
of LSID utilizes a COM plugin to allow users to perform LSID queries 
from a web browser. ie. 
lsid://<authority>:<namespace>:<identifier>:<version>


	I hope this helps. I'll be posting specific examples of LSID in RDF in 
the next few weeks which I hope will help clarify this issue further.

									Best,

												-B
	

-- 
Brian Gilman
President Panther Informatics Inc.
9 Acadia Park #2
Somerville, MA 02143
Phone 617-335-8276
E-Mail: gilmanb@pantherinformatics.com
         gilmanb@jforge.net
AIM: gilmanb1

01000010 01101001 01101111
01001001 01101110 01100110
01101111 01110010 01101101
01100001 01110100 01101001
01100011 01101001 01100001
01101110

Confidentiality Notice

This transmission and the documents contained herein are confidential 
and privileged.  The transmission and the  documents are intended only 
for the individuals or entities named above.  If you are not the 
intended recipient, any disclosure, copying, distribution or use of 
this transmission is prohibited.  If you received this transmission in 
error, please contact us
immediately so that we may arrange for its return.
On Apr 15, 2004, at 2:27 AM, Greg Tyrelle wrote:

>
> *** Marja-Riitta Koivunen wrote:
>   |>   I am not sure how to answer this ultimate question. Perhaps I 
> need to
>   |>understand more about HTTP URIs in order to give comparison with 
> the URN
>   |>used in LSID spec. To be honest I have tried to find more and I 
> gave up
>   |>after reading very nice article about HTTP URIs by Tim Berners-Lee
>   |>(http://www.w3.org/DesignIssues/HTTP-URI.html) that gave me 
> feeling that I
>   |>am out of the league :-(
>
> I am by no means an expert on this either :)
>
> How URIs are used in the web architecture and the semantic web
> architecture are contentious issues to say the least. Given the
> importance of standardisation for the life sciences e.g. MAGE, I am
> simply trying to understand how identifier schemes such as LSID fit
> into the current thinking about the semantic web and URIs.
>
>   |I think the question is mainly why reinvent a  wheel that already 
> exixts.
>
> Precisely.
>
>   |Using persistent HTTP URIs is a good goal because it is standard 
> and there
>   |exists a lot of HTTP based applications e.g. browsers that 
> understand HTTP
>   |URIs and can provide information of the resource on the Web without
>   |anything extra.
>
> This is the exactly the context in which I was trying to raise the
> issue of using LSIDs in RDF. Technically speaking there is nothing
> wrong with the current LSID specification IMO. However if I want to
> allow other users to dereference LSIDs that my authority mints, it
> requires me to maintain an LSID resolver. Clients must also have the
> necessary libraries to dereference the LSID.
>
> Having this infrastructure in place adds a practical burden to using
> LSIDs which would go away if LSIDs were to be specified in terms of
> HTTP URIs. Which, if we assume persistence is an organisational issue,
> then HTTP URIs are just as good as URNs. While the RDF spec says
> nothing about URIs being dereferenced to provide representations, my
> practical expectation is that they should i.e. it makes that resource
> more useful if I can get either a representation of that resource or
> metadata about that resource.
>
> The LSID does specify interfaces for retrieving metadata about a LSID
> which is a good thing. However I'll leave the "how to get metadata
> about a resource" question for later...
>
> This leads of course to the thorny issue of whether HTTP URIs are
> names or locations or both. My simple view on this is that a HTTP URI
> is a name in the same sense that the LSIDs are names, however it *may*
> also be dereferenced to provide a representation of the resource that
> it is naming (using the existing web infrastructure).
>
>   |When somebody defines a URN they can usually as well reserve a HTTP 
> URI
>   |e.g. http://www.lsid.org/ and define URIs under it e.g.
>   |http://www.lsid.org/path.../enzyme1.
>
> This maybe one possible solution, however the LSID spec says nothing
> about this.
>
>   |Why we are now ALSO discussing of using UUID URNs is because we have
>   |problems with local file: type URIs (file: URIs e.g. 
> file//marja/myfile123)
>   |and we want to make them unambiguous when a user cannot do HTTP 
> URIs maybe
>   |because a user does not own a http:// domain name where to publish 
> it or
>   |for some other reason.
>
> I can see this as an issue for individuals creating local stores of
> annotated bookmarks. In the case of the life sciences it would be
> easier to control as any authority using HTTP URI based LSIDs would
> need to have a http:// domain name to participate.
>
>   |After publishing we mostly want to use the HTTP URIs to be able to 
> benefit
>   |from the common Web standards. But it is also possible for some 
> application
>   |to benefit from the URN bit of the information when so wished.
>
> Maybe there is a best of both worlds approach ?
>
> _greg
>
> -- 
> Greg Tyrelle
>

Received on Tuesday, 20 April 2004 18:58:12 UTC