Re: blog: semantic dissonance in uniprot from John Madden on 2009-03-26 (public-semweb-lifesci@w3.org from March 2009)

From: John Madden <john.madden@duke.edu>
Date: Thu, 26 Mar 2009 12:06:40 -0700
To: Pat Hayes <phayes@ihmc.us>
Cc: Michel_Dumontier <Michel_Dumontier@carleton.ca>, W3C HCLSIG hcls <public-semweb-lifesci@w3.org>
Message-Id: <60682BC2-6A82-4D85-A5A1-CFDE1615A0C4@duke.edu>
Pat,

>
>>
>> So what would you say about an rdf:property called, say, "http://www.example.com/intuit#similarTo 
>> " that could be used simply to post a record that somebody intuited  
>> a "similarity" between two things?
>
> Well, what's wrong with seeAlso? Thats pretty much what we intended  
> it for.

Good point. Or I was also thinking about skos:related.

>
>>
>> It would have little utility for inferencing, unless one were to  
>> write a custom application (i.e. not OWL) to do so. But it might  
>> have utility as a semantic web "bookmark" for relationships that  
>> could be interesting candidates for future formalization.
>
> I like this 'bookmark' idea, yes. But let me run with it a little.  
> After reading some of these posts, the issue seems to be a desire to  
> have a topic-specific cross-referencing link. There are all these  
> various lists and tables and KBs about proteins, and its important  
> that they cross-refer very richly and densely, but some of them are  
> just text and some are machine-readable, and its not at all clear  
> that they are based on the same underlying conceptual model of what  
> a protein really is, and certainly not the same formal ontology of  
> proteins; so asserting sameAs is dangerously strong, whereas seeAlso  
> is just, well see also, not saying that its the same _protein_ being  
> talked about. Hence, I think, the call for this in-between kind of  
> link.

Yes, I think you've nailed it. And the main use case I can think of  
for this is supporting web communities that are in the business of  
evolving models. For example, the SWAN-SIOC project in HCLS, where the  
idea is to provide an environment where a community can discuss/argue  
about/refine hypotheses. People seem to want very badly to be able to  
make "murky" and tentative statements, that they would like to be able  
to take back at a later point (which is sort of what a hypothesis is).

You could imagine working on some kind of quasi-logical platform that  
supported things like fuzziness and non-monotonicity and so forth, but  
maybe that's not necessary. A lot of times, people just want to create  
such "links" purely for human consumption.


> So, here's how I'd do this. Introduce a property linking a protein  
> to  something (which might be anything from a piece of text to a  
> protein) called sameProteinAs. Its reflexive and transitive but  
> might not be symmetric (though it probably is when the value is  
> itself a protein). It is NOT substitutive. It means, roughly, that  
> its value either is, or has as its main topic, the same protein as  
> the argument. It is a mixture of sameAs restricted to proteins and  
> seeAlso restricted to cases where the topic is a single protein.

The "something" to which you link could even just be a blank node,  
Basically, if I understand you correctly, it's just a hypothetical  
tertium quid, that you might later abandon or declare to be devoid of  
any useful meaning. Or perhaps better, it's a collection that collects  
things that somebody thought were "similar" to each other. So if it's  
a class or set, it's a set whose intension is defined by some human  
opinion, not a class that makes any claim on being like a natural kind.

> This is an informal semantics, of course, but it might be enough for  
> the case in hand, and its easy to see how external machinery could  
> utilize it, e.g. you could set a scraper loose on text looking for  
> protein names. In rare cases, you might have a very guarded rule to  
> the effect that sameProteinAs implies sameAs, for limited use.

Again, I like this!! I think it's possible that some small group of  
like-minded people might, in limited circumstances, be able to agree  
on what similarity meant to them, and if they could, then they could  
presumably generate some inferences from that, which were valuable to  
them.

>
> This idea is quite general-purpose, and what would be nice, but  
> might not be OWL-kosher, would be to have a connection between this  
> property and the class of actual Proteins in a chemistry ontology,  
> along the lines of 'this linking property is relevant to things in  
> this class', because that could also be used to connect, say,  
> sameReagentAs to Reagents, and so on. But maybe this getting too  
> cute, as proteins do seem to be a unique case.
>
> Does this make sense?
>

It makes sense to me! I agree that it's not OWL. But it seems to serve  
a need.


> Pat
>
>>
>> John
>>
>>
>>
>> On Mar 26, 2009, at 8:42 AM, Pat Hayes wrote:
>>
>>>
>>> On Mar 26, 2009, at 8:28 AM, Michel_Dumontier wrote:
>>>
>>>> Pursuant to my email, and in light of several other comments, if  
>>>> our
>>>> goal is to now rectify what Uniprot:Protein _actually_ means in our
>>>> domain, and how it can be semantically mapped to other bio- 
>>>> ontologies,
>>>> then I might also suggest that instances of Uniprot:Protein are
>>>> aggregates of proteins (err... :ProteinAggregate anyone?), possibly
>>>> separated by both space and time, having a similar (base sequence +
>>>> mutations / ptms) composition, sharing certain characteristics  
>>>> (e.g.
>>>> functionality, domains) and observed to participate in biological
>>>> processes. Clearly not a type of protein of the single molecule  
>>>> form,
>>>> but again, certainly not a Record.
>>>
>>> Indeed. If I might make a suggestion, rather than talking about  
>>> 'aggregates' (which sounds disturbingly, er, philosophical), why  
>>> not just say that the entity being identified is a _substance_.  
>>> Substances are 'kinds of stuff' that include mixtures (eg concrete  
>>> is a kind of stuff comprising a mix of sand, crushed rock, cement  
>>> and water in several possible proportions) but also 'pure' stuffs  
>>> such as water. Note the distinction between a substance and a  
>>> piece of the substance (concrete, the building material vs,. this  
>>> or that lump of concrete) or a mereological sum (your 'aggregate',  
>>> I think) of such pieces (all the concrete in America). The utility  
>>> of this is that it eliminates the discussions about molecules,  
>>> which I think is getting in the way of clarity here.  Regarding  
>>> sameAs, being the same substance is a very strict kind of sameAs,  
>>> of course, but it really does only refer to substances, which is a  
>>> step in the right direction. Each protein is a substance. It might  
>>> turn out that one protein is a mixture of others, for example:  
>>> this is fine, nothing breaks, as long as nobody says the mixture  
>>> is sameAs one of its components. And now one can have notions such  
>>> as 'purified form of' or 'isotopic version of' between substances,  
>>> which might help to make all these distinctions that you chemists  
>>> need to be concerned with.
>>>
>>> Distinctions like object/substance/piece/mixture were worked out  
>>> by ontologists over 20 years ago, by the way. None of this is  
>>> rocket science.
>>>
>>> Pat
>>>
>>>
>>>>
>>>> -=Michel=-
>>>>
>>>>
>>>>
>>>>>
>>>>> If however, what we've been talking about is that identifiers like
>>>>> 	http://purl.uniprot.org/uniprot/Q16665
>>>>>
>>>>> are actually database records, and not molecular entities, then  
>>>>> we can
>>>>> settle this quickly:
>>>>>
>>>>> Uniprot RDF file: http://www.uniprot.org/uniprot/Q16665.rdf
>>>>> (is this what people were referring to as a Record???)
>>>>>
>>>>> Contains:
>>>>>
>>>>> <rdf:Description rdf:about="http://purl.uniprot.org/uniprot/ 
>>>>> Q16665">
>>>>> <rdf:type rdf:resource="http://purl.uniprot.org/core/Protein" />
>>>>>
>>>>>
>>>>> It's clear that the entity denoted by :Q16665 is  
>>>>> rdf:type :Protein and
>>>>> is the subject of statements that are biological in nature such as
>>>>> being
>>>>> located in sub-cellular compartments or being involved in  
>>>>> biochemical
>>>>> reactions. It is clearly not a Record. This is generally the  
>>>>> case for
>>>>> nearly all entries in biomolecular databases.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -=Michel=-
>>>>>
>>>>> Anxiously waiting see if this clears up things or generates
>>>> controversy
>>>>> .. it's hard to predict!
>>>>>
>>>>>
>>>>>
>>>>>> If nobody ever wants to use the same property to talk about the
>>>>>> database
>>>>>> record as was used to talk about the molecule, and nobody ever  
>>>>>> makes
>>>>> an
>>>>>> assertion that implies that the class of database records is
>>>> disjoint
>>>>>> from the class of molecules, then I don't see any harm in using  
>>>>>> the
>>>>>> same
>>>>>> URI to ambiguously denote both.   But if one is trying to design
>>>> data
>>>>>> to
>>>>>> be reusable by others in unforeseen ways, there clearly *is* a  
>>>>>> risk
>>>>>> that
>>>>>> someone will want to make such assertions in conjunction with the
>>>>> data,
>>>>>> and if that happens there is a clear harm.  This risk is easy to
>>>>> avoid
>>>>>> by using separate URIs.
>>>>>>
>>>>>> There *are* trade-offs.  Minting two URIs instead of one *does*  
>>>>>> add
>>>>>> some
>>>>>> complexity, though as I pointed out that additional complexity  
>>>>>> can
>>>> be
>>>>>> mitigated to the point that it is a *very* low cost.  Still,
>>>>> different
>>>>>> people will weigh these trade-offs differently, and what's best  
>>>>>> for
>>>>> one
>>>>>> situation may not be best for another, as I indicated in my  
>>>>>> original
>>>>>> post.
>>>>>>
>>>>>> Furthermore, even if one does use the same URI to ambiguously  
>>>>>> denote
>>>>>> both a database record and a molecule, that is not the end of the
>>>>> world
>>>>>> either.  It is possible (though more difficult) to later separate
>>>> out
>>>>>> and relate the different senses of an ambiguous URI, as I have
>>>>>> described:
>>>>>> http://dbooth.org/2007/splitting/
>>>>>> Ambiguity is inescapable, and ambiguity between a thing and a  
>>>>>> page
>>>>> that
>>>>>> describes that thing is not fundamentally different from other  
>>>>>> kinds
>>>>> of
>>>>>> ambiguity (except perhaps that we are aware of it in advance  
>>>>>> and it
>>>>> can
>>>>>> be easily avoided), as explained here:
>>>>>> http://dbooth.org/2007/splitting/#httpRange-14
>>>>>>
>>>>>> Finally, although it is flattering that you have named this
>>>>> suggestion
>>>>>> after me, I cannot take credit.  As I pointed out in my original
>>>>> post,
>>>>>> the suggestion to differentiate between a molecule and the  
>>>>>> database
>>>>>> record that describes that molecule originates with the  
>>>>>> Architecture
>>>>> of
>>>>>> the World Wide Web:
>>>>>> http://www.w3.org/TR/webarch/#URI-collision
>>>>>> and best practices for implementing this distinction are  
>>>>>> described
>>>> in
>>>>>> Cool URIs for the Semantic Web:
>>>>>> http://www.w3.org/TR/cooluris
>>>>>>
>>>>>> David Booth
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> ------------------------------------------------------------
>>> IHMC                                     (850)434 8903 or (650)494  
>>> 3973
>>> 40 South Alcaniz St.           (850)202 4416   office
>>> Pensacola                            (850)202 4440   fax
>>> FL 32502                              (850)291 0667   mobile
>>> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
> ------------------------------------------------------------
> IHMC                                     (850)434 8903 or (650)494  
> 3973
> 40 South Alcaniz St.           (850)202 4416   office
> Pensacola                            (850)202 4440   fax
> FL 32502                              (850)291 0667   mobile
> phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
>
>
>
>
>
>
Received on Thursday, 26 March 2009 19:21:59 UTC