protein entities (was Re: Rules (was Re: Ambiguous names. was: Re: URL +1, LSID -1) from Darren Natale on 2007-07-19 (public-semweb-lifesci@w3.org from July 2007)

From: Darren Natale <dan5@georgetown.edu>
Date: Thu, 19 Jul 2007 09:33:34 -0400
To: Alan Ruttenberg <alanruttenberg@gmail.com>
CC: Eric Jain <Eric.Jain@isb-sib.ch>, Chris Mungall <cjm@fruitfly.org>, Bijan Parsia <bparsia@cs.man.ac.uk>, public-semweb-lifesci hcls <public-semweb-lifesci@w3.org>
Message-ID: <469F682E.8080404@georgetown.edu>

Thank you Chris for including me on this thread.  I can well see why you 
did so!

We recently began a new Protein Ontology (PRO) effort geared precisely 
toward the formal definition of the "smaller entities" referred to by 
Alan.  By "we" I mean the PRO Consortium, comprising the PIs Cathy Wu of 
PIR (which is also a member organization of the UniProt Consortium), 
Barry Smith of SUNY Buffalo, and Judy Blake of Jackson Labs.  PRO is 
being developed within the framework of the OBO Foundry, and aims to 
specify protein entities at the level mentioned by Chris (accounting for 
splice variation and post-translational modification and cleavage). 
Where appropriate, PRO will indeed make reference to both other 
ontologies and to UniProt Knowledgebase (UniProtKB) records. 
Furthermore, we are also undertaking the "wildly ambitious" job of 
representing broader, more-inclusive classes of similar proteins based 
on evolutionary relatedness.

A further description of PRO (with examples and link to a paper) can be 
found at http://pir.georgetown.edu/pro

-Darren Natale


Alan Ruttenberg wrote:
> In that case, I would recommend  that it is unwise to use Uniprot ids as 
> identifiers of protein classes on the semantic web. Doing so would 
> encourage exactly the kind of ambiguity that we need to avoid in order 
> to write statements that will not confuse semantic web agents (including 
> people).
> 
> I would suggest instead, that Uniprot not suggest that they represent 
> specific classes of proteins, and instead keep them being exactly what 
> they are, records containing information about diverse sets of entites, 
> which we all admit is very useful. If there is interest in formalization 
> for semantic web use at Uniprot, perhaps the focus can be instead on the 
> smaller entities on which these records collect information.
> 
> Let others who are more interested in providing formal definitions for 
> proteins work on making definitions that carve out specific classes. 
> They can do so in part by pointing at information in the Uniprot records 
> and other sources.
> 
> -Alan
> 
> On Jul 17, 2007, at 4:33 AM, Eric Jain wrote:
> 
>> Alan Ruttenberg wrote:
>>> To clarify, no, I didn't mean this. I meant that the definition of 
>>> Uniprot records are already broad in the sense that sometimes 
>>> multiple splice variants are included in a single record, as are 
>>> population and disease-causing variants, according to Eric. Basically 
>>> I don't know what set of proteins people currently intend to denote 
>>> when they use a uniprot id as a protein, and I'm not entirely certain 
>>> what the curators intend. So step one would be an english description 
>>> of how to figure out what the curator's intent is, and we could go on 
>>> from there to define OWL definitions based on that. I suspect that 
>>> people currently using Uniprot ids may be using them in both broader 
>>> and narrow ways, but we could leave the discovery of such cases to a 
>>> reasoner once we had the basics in place.
>>
>> People do indeed use UniProtKB identifiers in both broad and narrow 
>> ways: The narrow way is to talk about the exact, main sequence that is 
>> shown...
> 
> I
>> In any case, I'm not too optimistic about being able to define our 
>> concepts in a strict, yet meaningful way, as often it's practical 
>> criteria that are used to decide, e.g. here's what one of our curators 
>> has to say on this:
>>
>> "[Usually] we have one entry per gene. We have several entries for a 
>> single gene when description of variations are too complicated to 
>> describe in FT lines (of course, this criteria depends on the 
>> annotator). For viruses, it is much more messy, due to ribosomal 
>> frameshifts."
>>
>> Formalize that! :-)
> 
>

Received on Thursday, 19 July 2007 19:14:02 UTC