Re: proposal for standard NCBI database URI from Matt Halstead on 2006-05-10 (public-semweb-lifesci@w3.org from May 2006)

From: Matt Halstead <matt.halstead@auckland.ac.nz>
Date: Wed, 10 May 2006 14:14:38 +1200
To: Matthias Samwald <samwald@gmx.at>
Cc: <public-semweb-lifesci@w3.org>
Message-Id: <1B782EF4-95DB-452B-9E26-6768DB2970FF@auckland.ac.nz>
On 9/05/2006, at 8:46 PM, Matthias Samwald wrote:

>
> Hi Alan,
>
>>  As far as I know there is no standard URI for a resource at NCBI. I
>>  would like to propose that there be one, since we will all need
>>  them to use when we refer to these resources  in our RDF. (and I
>>  need one *now*)
>
> I think we should be aware that this could be a VERY important  
> decision for the further development of RDF in the life sciences.  
> The URI - scheme we come up with during this project would probably  
> become THE standard for referencing ressources at the NCBI. I guess  
> we should try to contact someone from the NCBI to make sure the  
> soloution we come up with is acceptable to them. Maybe they will  
> soon realize the need for URIs themselves and start creating their  
> own, conflicting URI scheme. The last thing the Semantic Web would  
> need would be two different URIs for each of the many ressources in  
> the Entrez databases.
>
>
>>  Following other styles I've seen, I propose the following:
>>
>>
>>  1. http://www.ncbi.nlm.nih.gov/2006/entrez/<DATABASE_GOES_HERE>/
>>  <IDENTIFIER_GOES_HERE>
>>
>>  or
>>
>>
>>  2. http://www.ncbi.nlm.nih.gov/2006/entrez/
>>  <DATABASE_GOES_HERE>#<IDENTIFIER_GOES_HERE>

In my experience I have felt that leaving #identifier free for the  
most fine-grained data resources best provides URI readability. I use  
a composition rule :

If data resource_a is composed of dataresource_b and dataresource_c  
and dataresource_b and dataresource_c cease to exist if  
dataresource_a is destroyed, the the uri would be something like  
<domainname>/<database>/resource_a#<identifier> where <identifier>  
would be dataresoure_b and dataresource_c.  The #identifier typically  
appears in document specific contexts, i.e. id's within a particular  
document that are unique. But extending this to a database means that  
these documents are likely to be quite dynamic, and the document  
specificity of ids becomes blurred. That's why I'm trying composition  
(not aggregation) as a rule for when something is a #id. Not too sure  
on the results yet.


>
> We should have a look at how applications (especially triplestores)  
> handle this. Do they know how to split namespace from identifier in  
> the first case? I remember that the current version of the  
> triplestore Sesame has some performance - problems when handling  
> URNs, because he splits namespace and identifier in a wrong way  
> (creating a new namespace for almost every resource). I know that,  
> according to the RDF specification, the RDF ID is just an opaque  
> string, but applications do handle that differently.
>
>>  Rational: can use owl:sameAs to make them the same if we need to.
>>  We can suggest a best practice if we want to preferentially use one
>>  numbering system versus another. (I like the alphanumeric ones,
>>  myself)
>
> We would not be happy to have huge amounts of redundant resources  
> linked with owl:sameAs. owl:sameAs is nice when it only needs to be  
> used sparingly, but having two different naming schemes of a large  
> protein database linked through owl:sameAs would 'pollute' the  
> Semantic Web right from the beginning. We should seek to avoid this  
> when we are still in the position to do so.

I cannot see this can be avoided. The bigger picture is that  
different databases and groups associated with them will use  
different URI schemes for describing the same thing. Also, things  
that were deemed not the same once may become thought of as the same  
later.  It is also impossible to predict what URI naming schemes will  
make sense further down the track, or what factors various engines  
might play on (swoogle for instance). What I think there needs to be  
is a combination of careful thought and tools for URI normalisation,  
where yes there may come a time when suddenly a sameAs property is  
defined for every database record, but that a tool can be used by  
anyone to normalise to a preferred URI. Sort of like a agent's own  
cache victim, but for semantic web services where you may query a  
service with one URI, and if that is not a currently active version,  
the webservice would say "but that uri is also the sameas this  
preferred one" and so the agent can agree to update their URI and re- 
perform the query.



>
>
> kind regards,
> Matthias Samwald
>
>
>
> http://neuroscientific.net
>
> Section on Medical Expert and Knowledge-Based Systems
> Core Unit for Medical Statistics and Informatics
> Medical University of Vienna/Austria
> http://www.meduniwien.ac.at/mes/home_en.html
>
>
Received on Wednesday, 10 May 2006 02:14:56 UTC