versions in URIs from Andrew Dalke on 2001-06-24 (uri@w3.org from June 2001)

From: Andrew Dalke <dalke@acm.org>
Date: Sun, 24 Jun 2001 15:57:53 +0900
To: uri@w3.org
Message-Id: <4.2.0.58.J.20010624155747.03d8ded0@sh.w3.mag.keio.ac.jp>
Hello,

   I'm working with others to try to come up with a naming scheme
for bioinformatics data types.  (Bioinformatics, broadly speaking,
is the use of computers to understand biology but usually applies
to cellular mechanisms, with an emphasis on DNA, RNA and protein
sequences.)   I'm running into problems with how to treat different
versions of the same record.  But first, some background.

   This project is just starting up.  We've picked what I hope to
be the easiest topic, which is naming records in a sequence database
in a fashion that allows people and software to use the same name
to refer to the same record.  This implies coming up with some sort
of URN or URI scheme.

   There are hundreds of sequence and sequence-related databases of
which about 20 are commonly used in daily research.  Each record
in a specific release of a database has a unique key.  Each
database has a unique name.

   For demonstration purposes, assume one of the databases is
named "swissprot" and a record in that database is named "100K_RAT".
One possible naming scheme to use is

    bio:swissprot/100K_RAT

Another one would be to stick this in the urn: namespace, which if
I read the RFC correctly would be written

    urn:x-bio:swissprot/100K_RAT

This is easy to understand but there is a problem we're running
into, and I'm hoping for advice from people here.

How should we handle versions?

There are two types of versions - one is the database release version,
so "SWISS-PROT Release 38" or "PIR Release 104.2", and the other is
the version of the record, which might be written "100K_RAT.1",
"100K_RAT.2", etc.

(The record versioning is usually done for databases with no clear
release date - as for records available through the web and
continuously updated on the back-end.  In some cases you can ask
the database for different versions of the same records, which is
useful if you want to compare current records with historical ones.)

Upon close read of RFC 2396 I noticed the section 3.3 on "Path
Component", which mentions a "param" part of the segment in a
path segment.  That suggests a possible URI naming scheme like

    bio:swissprot;38/100K_RAT

where the ";38" means this is swissprot release 38.  Similarly,

    bio:swissprot/100K_RAT;2

could be used to specify version "2" of the 100K_RAT record.

The problem is that I can't find anything that suggests that
this is valid use of the param field - indeed, I can't find
anything which actually uses that, outside of some mentions
in the RFC itself.

For that matter, I can't find anything which described how to
handle versions in URLs.  Consider RFC 1737 ("Functional
Requirements for Uniform Resource Names") which has includes:
   o Global scope: A URN is a name with global scope which does
     not imply a location.  It has the same meaning everywhere

   o Global uniqueness: The same URN will never be assigned to
     two different resources.

Suppose I ask for "bio:swissprot/100K_RAT", that is, an
unversioned record.  This is a useful name even without the
version because in most cases people mean it to refer to the
most recent version of that record.

Now I try to resolve it.  My resolver happend to know about
two different swissprot releases, version 37 and 38.  Since
version 39 is more recent, it returns "bio:swissprot;38/100K_RAT;2"
instead of "bio:swissprot;37/100K_RAT;1".  And it just happens
that there was a mistake in the sequencer which got caught, so
the sequence in release 38 is different than that in 37.

The original name, "bio:swissprot/100K_RAT" therefore is not
a URN, because it it doesn't have global scope (it depends on
which databases the resolver knows about) and doesn't have
global uniqueness (there was choice of returning records from
release 37 or from release 38).

But note that "bio:swissprot;38/100K_RAT;2" and
"bio:swissprot;37/100K_RAT;1" are URNs.

Since "bio:swissprot/100K_RAT" isn't a urn, I can't say
"urn:bio:swissprot/100K_RAT" so it ends up being a generic
URI scheme instead - and one which includes some fully qualified
URIs which fit the functional definition of URNs.

(BTW, I may want to look at historical trends so there needs
to be a way to get all the resources that can be accepted by
a given partially qualified URI.)

So now that I've said all this, what's the accepted way to
handle version information in the URI framework, and why isn't
it better documented anywhere - or did I simply miss it?

Sincerely,
                     Andrew
                     dalke@acm.org

P.S.
   For what we're doing, I would like to use path_segments like
      <name>;<version>/<name>;<version>
because it maps well to the NamingService Identifier used in CORBA,
which looks like
      name.version/name.version

P.P.S.
   I am not subscribed to uir@w3.org so while I will read the
on-line archives to make sure I don't miss anything, I would
appreciate an emailed copy of any follow-ups.
Received on Sunday, 24 June 2001 03:25:31 UTC