proposal for standard NCBI database URI from Alan Ruttenberg on 2006-05-09 (public-semweb-lifesci@w3.org from May 2006)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Tue, 9 May 2006 00:00:39 -0400
To: public-semweb-lifesci@w3.org
Message-Id: <F63926B6-3E18-422A-A47C-38F02E05F87B@gmail.com>
As far as I know there is no standard URI for a resource at NCBI. I  
would like to propose that there be one, since we will all need them  
to use when we refer to these resources  in our RDF. (and I need one  
*now*)

Following other styles I've seen, I propose the following:

1. http://www.ncbi.nlm.nih.gov/2006/entrez/<DATABASE_GOES_HERE>/ 
<IDENTIFIER_GOES_HERE>

or

2. http://www.ncbi.nlm.nih.gov/2006/entrez/ 
<DATABASE_GOES_HERE>#<IDENTIFIER_GOES_HERE>

The list of valid databases can be viewed, e.g. in the popup at
http://www.ncbi.nih.gov/Database/datamodel/index.html

I propose that they all be made lower case for use in the following URIs

e.g.

1: http://www.ncbi.nlm.nih.gov/2006/entrez/gene/596
2: http://www.ncbi.nlm.nih.gov/2006/entrez/gene#596
1: http://www.ncbi.nlm.nih.gov/2006/entrez/protein/NP_000624
2: http://www.ncbi.nlm.nih.gov/2006/entrez/protein#NP_000624

I am partial to #1, because from a document service point of view it  
doesn't have the implication that there is a big document full of  
e.g. genes, and you should find the one you are looking at a specific  
place in that document.

Some suggestions, meant to avoid various excuses why we might not  
make a decision about this promptly:

1. Initial proposal is that we don't have to choose from the  
different identifiers, i.e.

http://www.ncbi.nlm.nih.gov/2006/entrez/protein/NP_000624
and
http://www.ncbi.nlm.nih.gov/2006/entrez/protein/72198189

Rational: can use owl:sameAs to make them the same if we need to. We  
can suggest a best practice if we want to preferentially use one  
numbering system versus another. (I like the alphanumeric ones, myself)

2. Initial proposal is that we don't include version information in  
these identifiers

Rational: We can later decide to also have those, and then add  
relations to connect the versions to the abstract, unversioned URIs.  
I will claim that for most of the work we are doing in this WG, the  
versions don't matter.

3. This proposal is not meant to oppose using LSIDs. However, I will  
note that there doesn't seem to be a working combination of a)  
specification of what these look like for NCBI, and b) a working  
resolver for the few examples I've seen[*]. Thus implementing LSIDs  
will require work = delay. However there is no reason that when an  
LSID solution comes on line that it can't be compatible with the  
choice we use now, by including a mention of it in the metadata, and  
vice versa when documents start to be served from these addresses.

4. Just because no web page is available at these URL's currently,  
doesn't mean we shouldn't use them. There is a pressing need for  
stable identifiers, and I would argue that while having a document at  
the URL is polite, not having one shouldn't block us have an  
identifier solution. However an easy thing to do would be to put a  
simple CGI that accepts all URLs below http://www.ncbi.nlm.nih.gov/ 
2006/entrez/, parses out the db and id, and says something polite.

5. If we screw up we can always bump to

http://www.ncbi.nlm.nih.gov/2007/

-Alan

[*]

plug

urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:54306556

suggested at

http://lsid.biopathways.org/authorities.shtml

into

http://linnaeus.zoology.gla.ac.uk/~rpage/lsid/tester/
Received on Tuesday, 9 May 2006 04:00:50 UTC