- From: Greg Tyrelle <greg@tyrelle.net>
- Date: Thu, 22 Apr 2004 16:07:35 +1000
- To: Brian Gilman <gilmanb@jforge.net>
- Cc: public-semweb-lifesci@w3.org
*** Brian Gilman wrote: | 1) URI's are a nightmare in the lifesciences. Particularly when |used to encode semantic information about a particular entity that |exists on the web. For example (from the DAS 1.0 specification): | | '/wormbase/das/elegans/features?segment=CHROMOSOME_I:1000,2000' | | | This leads the programmer and biologist to certain conclusions |about query semantics ie. what this URI encodes and (perhaps) what the |programmer meant when using a certain encoding scheme. People infer |meaning from a URI and learn this semantic very quickly. Some would |argue that this is a good thing however, once the biologist trains |themselves on this type of system, the developers of these systems are |forever locked into this scheme of identification. This will forever |become the identifier for this entity. In the case noted above, this is |particularly cumbersome: If a researcher has started to annotate this |region of the chromosome with metadata and the underlying data changes Both LSIDs and URLs are URIs, in which case they are intended to be opaque identifiers. You are not meant to infer anything about the resource from the URI ? I believe this is a case of using URIs incorrectly [1], not that HTTP URIs are broken. |As with any scientific data, there must be a way to reasonably |reproduce the evidence that lead to a particular result or hypothesis. My main issue is the use of LSID to identify biological objects or concepts in RDF (the example is borrowed from one of Eric Neumann's previous emails to the list): :A9311 a :Annotation ; dc:creator "Jonathan Smythe" ; affx:hypothesizes { :Gene5 a affx:Gene ; affx:hasVariant [ affx:representedBy :gi9887088; affx:process "GO:0006306"; affx:associatedWith "omim:209920" ] . } . If my "agent" is to add this hypothesis to it's KB, I might instruct it to find more information about the processes involved (assuming I don't already have this knowledge). If the GO terms and GIs were HTTP URIs I can dereference them to (hopefully) retrieve some useful information about those resources. However with LSID I must have the necessary infrastructure in place (resolvers, clients etc.). I have ignored the issue of retrieving the "object" vs. a description of the "object" in this case. But either way I have a better chance with HTTP. True the LSID spec defines mechanisms for getting descriptions of LSIDs via web services etc. I should investigate the implementations... |By encoding things with URI's we do not guard against the fact that the |underlying data may change. Why do we need to guard against the underlying data changing ? | This leads me to a question about "persistent" URI's and URL's |(PURLS's): How do you ensure that two URI's are pointing at the same |object (bytes)? If we can collectively answer this question we can |encode an LSID any way we please as long as we keep in mind that this |information must persist as long as a journal or other well vetted |scientific medium. You will not be able to "technically" insure two URIs are not pointing at the same object using LSID or HTTP URIs IMO. Also if LSIDs are going to be use to identify "concepts", what is to say that two authorities will have LSIDs for the concept p53 ? This is especially important considering their use in RDF to identify "resources". Encoding LSID as a HTTP URIs seems to be a way forward. Maybe some kind of mapping: URN:LSID:example.com:12345:1 http://example.com.lsid.org/12345/1 | 2) (sorry to be repetitive) Scientists typically perform |research on the web as a supplemental exercise. By this, I mean that |most researchers use data gathered from the web to enhance their |knowledge about a certain gene, protein, transcript, chemical etc. This |data is not typically referenced in a journal article etc. If we want |to allow for the incorporation and dissemination of scientific |information and knowledge across the internet as a common means of |communication we need to ensure two things: | | a) Persistence | b) Provenance | | Science requires that an experiment be reproducible by other |researchers and that the discoverer/institution get credit for the |discovery made or technique used to make the discovery. We must pay |particular attention to this as we craft the LSID specification. I agree that scientists typically "Perform research on the web" as a supplement to "real experiments". This is edging towards "systems biology" i.e. how to best utilise our existing knowledge to better target our experimental approaches (my understanding of it). That being said LSID URNs are not part of the web, they don't use web infrastructure. I maintain that persistence is an organisational issue and not necessarily technical. There is both a social and trust component of persistence e.g. I trust that the NCBI will maintain persistent URIs for their records (maybe not :)) or a LSID authority for that matter. | 3) Browsers, HTTP semantics of query, RESTful interfaces, etc. |are secondary to how data is used in the industry. Having a resolver to |get at a particular piece of information should not be a huge barrier |to the LSID specification's adoption. Case in point, IBM's |implementation of LSID utilizes a COM plugin to allow users to perform |LSID queries from a web browser. ie. |lsid://<authority>:<namespace>:<identifier>:<version> I'm not sure what to make of this, points two and three seem to contradict each other. Scientists performing "research on the web" use browsers, HTTP, and RESTful interfaces. However the web (Browsers, HTTP, REST) are "secondary to how data is used in industry" ? Again, a life sciences identifier scheme that is not part of the web is less useful to me. But maybe to industry perhaps ? My intention here is not to be harshly critical of the LSID spec or the points you have made. I am very sympathetic to the view that persistent identifiers be widely adopted in the life sciences. I am just trying to figure out best practice use for LSIDs in RDF etc. | I hope this helps. I'll be posting specific examples of LSID in |RDF in the next few weeks which I hope will help clarify this issue |further. Some examples and further discussion on identifiers (LSID) for life sciences as they pertain to the semantic web would be great. _greg [1] http://www.w3.org/Provider/Style/URI _greg -- Greg Tyrelle
Received on Thursday, 22 April 2004 02:18:40 UTC