- From: Andrew Dalke <dalke@acm.org>
- Date: Sun, 24 Jun 2001 15:57:53 +0900
- To: uri@w3.org
Hello, I'm working with others to try to come up with a naming scheme for bioinformatics data types. (Bioinformatics, broadly speaking, is the use of computers to understand biology but usually applies to cellular mechanisms, with an emphasis on DNA, RNA and protein sequences.) I'm running into problems with how to treat different versions of the same record. But first, some background. This project is just starting up. We've picked what I hope to be the easiest topic, which is naming records in a sequence database in a fashion that allows people and software to use the same name to refer to the same record. This implies coming up with some sort of URN or URI scheme. There are hundreds of sequence and sequence-related databases of which about 20 are commonly used in daily research. Each record in a specific release of a database has a unique key. Each database has a unique name. For demonstration purposes, assume one of the databases is named "swissprot" and a record in that database is named "100K_RAT". One possible naming scheme to use is bio:swissprot/100K_RAT Another one would be to stick this in the urn: namespace, which if I read the RFC correctly would be written urn:x-bio:swissprot/100K_RAT This is easy to understand but there is a problem we're running into, and I'm hoping for advice from people here. How should we handle versions? There are two types of versions - one is the database release version, so "SWISS-PROT Release 38" or "PIR Release 104.2", and the other is the version of the record, which might be written "100K_RAT.1", "100K_RAT.2", etc. (The record versioning is usually done for databases with no clear release date - as for records available through the web and continuously updated on the back-end. In some cases you can ask the database for different versions of the same records, which is useful if you want to compare current records with historical ones.) Upon close read of RFC 2396 I noticed the section 3.3 on "Path Component", which mentions a "param" part of the segment in a path segment. That suggests a possible URI naming scheme like bio:swissprot;38/100K_RAT where the ";38" means this is swissprot release 38. Similarly, bio:swissprot/100K_RAT;2 could be used to specify version "2" of the 100K_RAT record. The problem is that I can't find anything that suggests that this is valid use of the param field - indeed, I can't find anything which actually uses that, outside of some mentions in the RFC itself. For that matter, I can't find anything which described how to handle versions in URLs. Consider RFC 1737 ("Functional Requirements for Uniform Resource Names") which has includes: o Global scope: A URN is a name with global scope which does not imply a location. It has the same meaning everywhere o Global uniqueness: The same URN will never be assigned to two different resources. Suppose I ask for "bio:swissprot/100K_RAT", that is, an unversioned record. This is a useful name even without the version because in most cases people mean it to refer to the most recent version of that record. Now I try to resolve it. My resolver happend to know about two different swissprot releases, version 37 and 38. Since version 39 is more recent, it returns "bio:swissprot;38/100K_RAT;2" instead of "bio:swissprot;37/100K_RAT;1". And it just happens that there was a mistake in the sequencer which got caught, so the sequence in release 38 is different than that in 37. The original name, "bio:swissprot/100K_RAT" therefore is not a URN, because it it doesn't have global scope (it depends on which databases the resolver knows about) and doesn't have global uniqueness (there was choice of returning records from release 37 or from release 38). But note that "bio:swissprot;38/100K_RAT;2" and "bio:swissprot;37/100K_RAT;1" are URNs. Since "bio:swissprot/100K_RAT" isn't a urn, I can't say "urn:bio:swissprot/100K_RAT" so it ends up being a generic URI scheme instead - and one which includes some fully qualified URIs which fit the functional definition of URNs. (BTW, I may want to look at historical trends so there needs to be a way to get all the resources that can be accepted by a given partially qualified URI.) So now that I've said all this, what's the accepted way to handle version information in the URI framework, and why isn't it better documented anywhere - or did I simply miss it? Sincerely, Andrew dalke@acm.org P.S. For what we're doing, I would like to use path_segments like <name>;<version>/<name>;<version> because it maps well to the NamingService Identifier used in CORBA, which looks like name.version/name.version P.P.S. I am not subscribed to uir@w3.org so while I will read the on-line archives to make sure I don't miss anything, I would appreciate an emailed copy of any follow-ups.
Received on Sunday, 24 June 2001 03:25:31 UTC