Re: Rules (was Re: Ambiguous names. was: Re: URL +1, LSID -1) from Xiaoshu Wang on 2007-07-18 (public-semweb-lifesci@w3.org from July 2007)

From: Xiaoshu Wang <wangxiao@musc.edu>
Date: Wed, 18 Jul 2007 11:02:48 +0100
To: Alan Ruttenberg <alanruttenberg@gmail.com>
CC: Eric Jain <Eric.Jain@isb-sib.ch>, Chris Mungall <cjm@fruitfly.org>, Bijan Parsia <bparsia@cs.man.ac.uk>, public-semweb-lifesci hcls <public-semweb-lifesci@w3.org>, Darren Natale <dan5@georgetown.edu>
Message-ID: <469DE548.8030602@musc.edu>

I agree with Alan but feel sympathy for Eric as well.  In the absence of 
a universally accepted ontology for describing biological entities, Eric 
has to develop something to start working on SW. 

But please note, just because "http://purl.uniprot.org/core/Protein" 
contains the string "Protein" does not make it the identifier for 
*Protein*, unless everyone else agrees to it.  In an open world 
environment, which RDF is in, everything makes sense as long as there is 
no contradiction.  The ambiguity problem will only arise when the term 
is to be aligned with other terms, which is not the case yet. The 
development of SW will be an evolving process because it is impossible 
to get things right at the very first try.  I think the guideline to 
best practice should encourage to (1) try to reuse existing ontology and 
(2) if no such ontology exists, build your own.  Eric's case obviously 
felt into the second case. If more users agree the uniprot ontology, it 
is great and uniprot can gradually evolve into a standard.  If not, we 
can learn some lesson.

That's my two cents,

Xiaoshu

Alan Ruttenberg wrote:
> In that case, I would recommend  that it is unwise to use Uniprot ids 
> as identifiers of protein classes on the semantic web. Doing so would 
> encourage exactly the kind of ambiguity that we need to avoid in order 
> to write statements that will not confuse semantic web agents 
> (including people).
>
> I would suggest instead, that Uniprot not suggest that they represent 
> specific classes of proteins, and instead keep them being exactly what 
> they are, records containing information about diverse sets of 
> entites, which we all admit is very useful. If there is interest in 
> formalization for semantic web use at Uniprot, perhaps the focus can 
> be instead on the smaller entities on which these records collect 
> information.
>
> Let others who are more interested in providing formal definitions for 
> proteins work on making definitions that carve out specific classes. 
> They can do so in part by pointing at information in the Uniprot 
> records and other sources.
>
> -Alan
>
> On Jul 17, 2007, at 4:33 AM, Eric Jain wrote:
>
>> Alan Ruttenberg wrote:
>>> To clarify, no, I didn't mean this. I meant that the definition of 
>>> Uniprot records are already broad in the sense that sometimes 
>>> multiple splice variants are included in a single record, as are 
>>> population and disease-causing variants, according to Eric. 
>>> Basically I don't know what set of proteins people currently intend 
>>> to denote when they use a uniprot id as a protein, and I'm not 
>>> entirely certain what the curators intend. So step one would be an 
>>> english description of how to figure out what the curator's intent 
>>> is, and we could go on from there to define OWL definitions based on 
>>> that. I suspect that people currently using Uniprot ids may be using 
>>> them in both broader and narrow ways, but we could leave the 
>>> discovery of such cases to a reasoner once we had the basics in place.
>>
>> People do indeed use UniProtKB identifiers in both broad and narrow 
>> ways: The narrow way is to talk about the exact, main sequence that 
>> is shown...
>
> I
>> In any case, I'm not too optimistic about being able to define our 
>> concepts in a strict, yet meaningful way, as often it's practical 
>> criteria that are used to decide, e.g. here's what one of our 
>> curators has to say on this:
>>
>> "[Usually] we have one entry per gene. We have several entries for a 
>> single gene when description of variations are too complicated to 
>> describe in FT lines (of course, this criteria depends on the 
>> annotator). For viruses, it is much more messy, due to ribosomal 
>> frameshifts."
>>
>> Formalize that! :-)
>
>
>
>

Received on Wednesday, 18 July 2007 10:04:03 UTC