Re: Rules (was Re: Ambiguous names. was: Re: URL +1, LSID -1) from Chris Mungall on 2007-07-17 (public-semweb-lifesci@w3.org from July 2007)

From: Chris Mungall <cjm@fruitfly.org>
Date: Mon, 16 Jul 2007 20:24:39 -0700
To: Eric Jain <Eric.Jain@isb-sib.ch>
Cc: Bijan Parsia <bparsia@cs.man.ac.uk>, public-semweb-lifesci hcls <public-semweb-lifesci@w3.org>, Darren Natale <dan5@georgetown.edu>
Message-Id: <9B04C94B-F19B-422B-A17C-210C32209375@fruitfly.org>
On Jul 16, 2007, at 10:29 AM, Eric Jain wrote:

>
> Bijan Parsia wrote:
>> Eric, I would be very much interested in some more details about  
>> the sort of rules used and how they are used. I personally tend to  
>> distinguish between the use of rules in modeling and the use of  
>> rules for data munging tasks. Obviously, where you draw this  
>> boundary can be a matter of taste and situation, but it seems to  
>> be a useful distinction. It's unclear to me where the rules you  
>> describe fall.
>> There is some effort coming out of OWLED 2007 to improve the  
>> infrastructure situation (from implementation to documentation)  
>> with regard to rules and OWL, so any information you can give on  
>> use patterns and needs would be very helpful. (Also, C&P has a  
>> summer intern working on rule support in Pellet and having real  
>> uses would be nicely motivating :)).
>
> See http://expasy.org/sprot/hamap/unirules.html.
>
> Example rule: http://expasy.org/unirules/MF_00344.

Implementing these kinds of rules using OWL, SWRL or some other logic  
or non-logic based formalism would be a nice project  - but I think  
this is deviating somewhat from the original point. We seem to have  
switched from definitions to rules. It's important to keep these  
separate when we are talking OWL, despite the apparent similarities.

We have also switched from talk of defining specific proteins to  
rules to automatically annotate protein records.

Alan:

> I'm not advocating that we build definitions around protein  
> sequences, just that we build definitions, period.
> And that we don't confuse a page of html with a definition.
>
> The uniprot curators are great! They know what they are looking for  
> and they are skilled at finding it. Let's put work into formalizing  
> whatever we can about what they know so that the fruits of their  
> labor can be used effectively on the SW too!
>
> We've got a SW language for making definitions - it's called OWL.  
> If we have class names and definitions even for broad classes of  
> proteins, then we can start to build new definitions by subclassing  
> them, for instance into specific classes of sequence and post- 
> translational variants. Lots of work goes on in the scientific  
> community to characterize specifics about these subclasses and we  
> need a place to anchor that knowledge in the SW.

Eric followed this with:

> One thing I can say here is that there is the trend that curators  
> create rules (and check the outcome) instead of adding data  
> themselves directly. Unfortunately OWL is insufficient for the kind  
> of ugly rules they need to create; maybe SWRL will allow us to  
> distribute at least part of the rules.
>
> Most of the rule-based annotation is done for microbial proteins at  
> the moment, simpler as you don't have to deal with alternative  
> splicing etc. Don't expect any neat rules that define what goes  
> together anytime soon!

(which led to Bijan's request above)

I think this is correct, but I don't think it quite follows from what  
Alan was talking about.

In a sibling node in the same thread DAG, Phil said:

> A uniprot record defines a class of proteins extensionally
...
> It would be more satisfying for us to know intentionally what we  
> mean by
> "protein". It would be good to have a clear set of definitions. But,
> ultimately, I think it would be mistaken. If we have the ability to  
> express
> "the class of protein molecules defined by the swissprot record  
> OPSD_HUMAN",
> then I think we have all we need.
>
> If we make our own definitions, all that we have done is duplicate  
> what the
> uniprot team are already doing. And we will, almost inevitably, do  
> it somewhat
> differently. All we would do is create confusion. The only way that  
> we ensure
> that we do the same thing as uniprot is say "yeah, what they said".
>
> Unsatisfying, maybe. Clear definitions are important. But  
> interoperability,
> and the lack of duplication are more so.

I think if I understand correctly, Alan is making two requests, one  
is low hanging fruit and the other is wildly ambitious.

The LHF first: Alan, being an optimistic, would like OWL definitions  
of the entities in reality denoted by UniProt/SwissProt entries with  
names like OPSD_HUMAN. Phil, Newcastle Brown bottle half-empty,  
thinks we can do no better than the circular "the class of protein  
molecules defined by the swissprot record OPSD_HUMAN". I am with Alan  
and think we can do a little better than this. There *is* an implicit  
definition in UniProt entries that can be made explicit using a  
logical language such as OWL.

Phil says "A uniprot record defines a class of proteins  
extensionally". If we are using intensional/extensional in the set- 
theoretic sense, I don't believe this is true. If the implicit  
definition in a UniProt record is extensional, then the UniProt entry  
for OPSD_HUMAN would list every particular spatiotemporal instance of  
this protein - this would be rather a long record.

There is an implicit intensional definition in OPSD_HUMAN: a protein  
encoded by nuclear or mitochondrial DNA of a human cell that has a  
linear sequence of amino acids commencing with an instances of  
Methionine, followed by an instance of N, G, T, ..., E,T,S,Q,V,A,P,  
and ending with an instance of Alanine.

Of course, a UniProt record tells us more than this, but for now we  
are talking of definitions.

It would be possible to make various objections here: what about post- 
translational modifications? Sequence variants? These are easily  
accommodated. I'm going to duck other objects pertaining to  
transgenic genes and so on since there is nothing here that doesn't  
crop up all over the place. You'll notice I am also eliding on the  
nature of the relation that holds between the residues in the amino  
acid sequence - I believe Michel Dumontier's work on formalizing  
chemical structures is relevant here.

All this could be made explicit in OWL. I'm neutral as to how useful  
it would be to do this for all of UniProt, and as to the  
implementation details. But this would certainly seem to satisfy  
Alan's requirement for formalizing some aspect of UniProt entries.  
And it can be done in OWL-DL with no need for SWRL. I initially  
though this was what Alan was requesting.

But in the snippet above, Alan says:

"I'm not advocating that we build definitions around protein  
sequences" ... "If we have class names and definitions even for broad  
classes of proteins, then we can start to build new definitions by  
subclassing them, for instance into specific classes of sequence and  
post-translational variants"

I read "broad classes of proteins" as being more inclusive than the  
class denoted by OPSD_HUMAN in my interpretation, but also including  
for example all human opsin proteins, all vertebrate opsins, ...

This is what I class as wildly ambitious. Besides, this seems outside  
the scope of what is typically in a UniProt record.

To summarise: the hypothesis is that any UniProt entry can be  
formally defined using OWL-DL in an automated fashion in a way that  
is reasonably concordant with the intent of UniProt. There may well  
be counter-examples that disprove this.

It doesn't follow from this that UniProt should necessarily serve OWL  
of this form in response to any kind of identifier resolution  
request. There may well be massive advantages to the current record- 
oriented RDF that is returned. I have no strong opinions here, and it  
seems both should be able to coexist.
Received on Tuesday, 17 July 2007 03:25:42 UTC