Re: versions in URIs from Michael Mealling on 2001-06-24 (uri@w3.org from June 2001)

From: Michael Mealling <michaelm@neonym.net>
Date: Sun, 24 Jun 2001 11:48:50 -0400
To: Andrew Dalke <dalke@acm.org>
Cc: uri@w3.org
Message-ID: <20010624114849.T2188@bailey.dscga.com>
On Sun, Jun 24, 2001 at 03:57:53PM +0900, Andrew Dalke wrote:
>    I'm working with others to try to come up with a naming scheme
> for bioinformatics data types.  (Bioinformatics, broadly speaking,
> is the use of computers to understand biology but usually applies
> to cellular mechanisms, with an emphasis on DNA, RNA and protein
> sequences.)   I'm running into problems with how to treat different
> versions of the same record.  But first, some background.

Cool application. I've always thought the management issues surrounding
the human genome by itself have to be insanely hard....

>    There are hundreds of sequence and sequence-related databases of
> which about 20 are commonly used in daily research.  Each record
> in a specific release of a database has a unique key.  Each
> database has a unique name.

Just out of curiousity, what is the authority that maintains the registry
of database names?

>    For demonstration purposes, assume one of the databases is
> named "swissprot" and a record in that database is named "100K_RAT".
> One possible naming scheme to use is
> 
>     bio:swissprot/100K_RAT

Seems straigtforward....

> Another one would be to stick this in the urn: namespace, which if
> I read the RFC correctly would be written
> 
>     urn:x-bio:swissprot/100K_RAT

The main question for it being a URN namespace is: do you need the 
persistent naming semantics of URNs?

> How should we handle versions?
> 
> There are two types of versions - one is the database release version,
> so "SWISS-PROT Release 38" or "PIR Release 104.2", and the other is
> the version of the record, which might be written "100K_RAT.1",
> "100K_RAT.2", etc.
> 
> (The record versioning is usually done for databases with no clear
> release date - as for records available through the web and
> continuously updated on the back-end.  In some cases you can ask
> the database for different versions of the same records, which is
> useful if you want to compare current records with historical ones.)

I would take that to mean that you really do want to make sure that
a record's name isn't re-used at some latter date?

> Upon close read of RFC 2396 I noticed the section 3.3 on "Path
> Component", which mentions a "param" part of the segment in a
> path segment.  That suggests a possible URI naming scheme like
> 
>     bio:swissprot;38/100K_RAT
> 
> where the ";38" means this is swissprot release 38.  Similarly,
> 
>     bio:swissprot/100K_RAT;2
> 
> could be used to specify version "2" of the 100K_RAT record.

Yep.

> The problem is that I can't find anything that suggests that
> this is valid use of the param field - indeed, I can't find
> anything which actually uses that, outside of some mentions
> in the RFC itself.

IMHO, it is a valid use since its putting a version statement on the
authority. The tag: scheme recently discussed puts a timestamp
there as a way to ensure that a reference to a domain-name is
'permanent' in case the domain-name is re-used...

> For that matter, I can't find anything which described how to
> handle versions in URLs.  Consider RFC 1737 ("Functional
> Requirements for Uniform Resource Names") which has includes:
>    o Global scope: A URN is a name with global scope which does
>      not imply a location.  It has the same meaning everywhere
> 
>    o Global uniqueness: The same URN will never be assigned to
>      two different resources.

When you read the word 'resources' here it is in the _abstract_
sense, not the 'sequence of network bits' sense....

> Suppose I ask for "bio:swissprot/100K_RAT", that is, an
> unversioned record.  This is a useful name even without the
> version because in most cases people mean it to refer to the
> most recent version of that record.

Which is valid in URN space. The 'resource' that URN is bound to
is the "current record". This is a classic example of the "weather map"
scenario. There is a weather map resource on the Internet that changes
every 10 minutes to show the current weather. You can assign two URNs
to this weather map. The first one is bound to the concept of "the
current weather map". The second one is bound to each version of the map.

> Now I try to resolve it.  My resolver happend to know about
> two different swissprot releases, version 37 and 38.  Since
> version 39 is more recent, it returns "bio:swissprot;38/100K_RAT;2"
> instead of "bio:swissprot;37/100K_RAT;1".  And it just happens
> that there was a mistake in the sequencer which got caught, so
> the sequence in release 38 is different than that in 37.
> 
> The original name, "bio:swissprot/100K_RAT" therefore is not
> a URN, because it it doesn't have global scope (it depends on
> which databases the resolver knows about) 

Not necessarily. The URN is bound to the abstract concept "the most
current record you know about" (And these semantics are kind of up
to you anyway). It would be the same situation as if you were 
temporarily disconnected from the Internet and you attempted
to resolve any URI based on incomplete knowledge.

> and doesn't have
> global uniqueness (there was choice of returning records from
> release 37 or from release 38).

Again, in this case that is the thing the "abstract concept" talks
about, not the binding of the URN to that concept.

> But note that "bio:swissprot;38/100K_RAT;2" and
> "bio:swissprot;37/100K_RAT;1" are URNs.

Sure....

> Since "bio:swissprot/100K_RAT" isn't a urn, I can't say
> "urn:bio:swissprot/100K_RAT" so it ends up being a generic
> URI scheme instead - and one which includes some fully qualified
> URIs which fit the functional definition of URNs.

You could do that but IMHO it isn't required. 'bio:swissprot/100K_RAT'
is a valid URN....

> (BTW, I may want to look at historical trends so there needs
> to be a way to get all the resources that can be accepted by
> a given partially qualified URI.)

And in that case it is a particular metadata query for the URN that
is bound to the concept from which you want this historical information.

> So now that I've said all this, what's the accepted way to
> handle version information in the URI framework, and why isn't
> it better documented anywhere - or did I simply miss it?

The main reason 'version' isn't in the framework is that 'version'
means so many things to so many people. We had enough issues with
the simple [sic] concepts of names, addresses and identifiers that
tackling that one was simply to hard. We also figured we'd given
enough freedom at certain parts of the URI that versioning could
be used rationally and if someone had come up with a fairly good
solution we could start adopting it...

> P.S.
>    For what we're doing, I would like to use path_segments like
>       <name>;<version>/<name>;<version>
> because it maps well to the NamingService Identifier used in CORBA,
> which looks like
>       name.version/name.version

And that creates a slight issue for URNs since you have to % encode
'/' in a URN since the regular URI hierarchy semantics are deprecated
in URNs....

>    I am not subscribed to uir@w3.org so while I will read the
> on-line archives to make sure I don't miss anything, I would
> appreciate an emailed copy of any follow-ups.

Sure....

-MM

-- 
--------------------------------------------------------------------------------
Michael Mealling	|      Vote Libertarian!       | urn:pin:1
michael@neonym.net      |                              | http://www.neonym.net
                        |                              | go:Michael Mealling
Received on Sunday, 24 June 2001 11:53:21 UTC