RE: blog: semantic dissonance in uniprot from Michel_Dumontier on 2009-03-26 (public-semweb-lifesci@w3.org from March 2009)

From: Michel_Dumontier <Michel_Dumontier@carleton.ca>
Date: Thu, 26 Mar 2009 11:33:44 -0400
To: "Kei Cheung" <kei.cheung@yale.edu>
Cc: "W3C HCLSIG hcls" <public-semweb-lifesci@w3.org>, "Matthias Samwald" <samwald@gmx.at>
Message-ID: <AB349814F1ECB143A5D4CD29C7A6456903AADAD4@CCSEXB10.CUNET.CARLETON.CA>
Kei,
  To make correct semantic mapping, each of these has to be looked at
separately. 

The wikipedia URI refers to a human readable article. If you look at the
categories it was placed into : Molecular biology | Nutrition | Proteins
| Proteomics, these relations don't reflect those of biological
entities. Even though there statements about real proteins in the
article, we have no such identifier for them.

The DBPedia URI refers to some mishmash of stuff whose attributes don't
really conform to anything in reality - at best - it is a DBPedia entry
:-) A major problem for people interested in querying linked data with
meaningful semantics - use rdfs:seeAlso.

The OBO URIs also suffer from the fact that there are numerous "Protein"
classes each having their own (human readable) description, and with
weird semantic descriptions. 
 the Sequence Type ontology says a Protein (SO:0000104) is a Biological
Region (SO:0001411).
 The PSI-MI ontology says a Protein (MI:00326) is a Interactor type
(MI:00313)
 The FMA ontology says a Protein (FMA:67257) is a Standard FMA Class
(FMA:85800)
 The Gene Ontology says a Protein (GO:0003675) is a Thing (the entity
was removed, but then later added so the identifier wouldn't be reused)
 The CCO ontology says a Protein (CCO:U0000005) is a Gene Product
(CCO:U0000003), although proteins can be synthesized, and a Biological
Entity (CCO:U0000000)
 The PRO Ontology says nothing about a Protein (PRO:000000001), but if
you look at subclasses like that of RING-box protein 2 isoform 1
phosphorylated form (PRO:000000396), we _don't know_ that it is a type
of Protein, but is asserted it was derived from its unphosphorylated
form (PRO:000000178), even though this is not necessarily true (as many
biotransformations could results in that product).

So... this brief analysis is probably enough to strike fear and horror
in the hearts of those that love and cherish the logic-based foundations
of ontologies that are meant to describe the world that we live in, and
to cause concern for those who are arbitrarily throwing triples into
large hot cauldrons and hoping something magical emerges, and for those
who are pinning their hopes on a single unified ontology for all
things... good luck :-)

-=Michel=-
 
BTW I'm available as a knowledge engineering consultant!


> -----Original Message-----
> From: Kei Cheung [mailto:kei.cheung@yale.edu]
> Sent: Thursday, March 26, 2009 10:07 AM
> To: Michel_Dumontier
> Cc: W3C HCLSIG hcls; Matthias Samwald
> Subject: Re: blog: semantic dissonance in uniprot
> 
> In addition to Uniprot, in light of Matthias' earlier email, what
about
> http://en.wikipedia.org/wiki/Protein, http://dbpedia.org/page/Protein,
> and the protein related ontologies listed in OBO
> (http://www.obofoundry.org/)?
> 
> -Kei
> 
> Michel_Dumontier wrote:
> > Pursuant to my email, and in light of several other comments, if our
> > goal is to now rectify what Uniprot:Protein _actually_ means in our
> > domain, and how it can be semantically mapped to other bio-
> ontologies,
> > then I might also suggest that instances of Uniprot:Protein are
> > aggregates of proteins (err... :ProteinAggregate anyone?), possibly
> > separated by both space and time, having a similar (base sequence +
> > mutations / ptms) composition, sharing certain characteristics (e.g.
> > functionality, domains) and observed to participate in biological
> > processes. Clearly not a type of protein of the single molecule
form,
> > but again, certainly not a Record.
> >
> > -=Michel=-
> >
> >
> >
> >
> >>  If however, what we've been talking about is that identifiers like
> >>  	http://purl.uniprot.org/uniprot/Q16665
> >>
> >> are actually database records, and not molecular entities, then we
> can
> >> settle this quickly:
> >>
> >> Uniprot RDF file: http://www.uniprot.org/uniprot/Q16665.rdf
> >> (is this what people were referring to as a Record???)
> >>
> >> Contains:
> >>
> >> <rdf:Description
rdf:about="http://purl.uniprot.org/uniprot/Q16665">
> >>  <rdf:type rdf:resource="http://purl.uniprot.org/core/Protein" />
> >>
> >>
> >> It's clear that the entity denoted by :Q16665 is rdf:type :Protein
> and
> >> is the subject of statements that are biological in nature such as
> >> being
> >> located in sub-cellular compartments or being involved in
> biochemical
> >> reactions. It is clearly not a Record. This is generally the case
> for
> >> nearly all entries in biomolecular databases.
> >>
> >> Cheers,
> >>
> >> -=Michel=-
> >>
> >> Anxiously waiting see if this clears up things or generates
> >>
> > controversy
> >
> >> .. it's hard to predict!
> >>
> >>
> >>
> >>
> >>> If nobody ever wants to use the same property to talk about the
> >>> database
> >>> record as was used to talk about the molecule, and nobody ever
> makes
> >>>
> >> an
> >>
> >>> assertion that implies that the class of database records is
> >>>
> > disjoint
> >
> >>> from the class of molecules, then I don't see any harm in using
the
> >>> same
> >>> URI to ambiguously denote both.   But if one is trying to design
> >>>
> > data
> >
> >>> to
> >>> be reusable by others in unforeseen ways, there clearly *is* a
risk
> >>> that
> >>> someone will want to make such assertions in conjunction with the
> >>>
> >> data,
> >>
> >>> and if that happens there is a clear harm.  This risk is easy to
> >>>
> >> avoid
> >>
> >>> by using separate URIs.
> >>>
> >>> There *are* trade-offs.  Minting two URIs instead of one *does*
add
> >>> some
> >>> complexity, though as I pointed out that additional complexity can
> >>>
> > be
> >
> >>> mitigated to the point that it is a *very* low cost.  Still,
> >>>
> >> different
> >>
> >>> people will weigh these trade-offs differently, and what's best
for
> >>>
> >> one
> >>
> >>> situation may not be best for another, as I indicated in my
> original
> >>> post.
> >>>
> >>> Furthermore, even if one does use the same URI to ambiguously
> denote
> >>> both a database record and a molecule, that is not the end of the
> >>>
> >> world
> >>
> >>> either.  It is possible (though more difficult) to later separate
> >>>
> > out
> >
> >>> and relate the different senses of an ambiguous URI, as I have
> >>> described:
> >>> http://dbooth.org/2007/splitting/
> >>> Ambiguity is inescapable, and ambiguity between a thing and a page
> >>>
> >> that
> >>
> >>> describes that thing is not fundamentally different from other
> kinds
> >>>
> >> of
> >>
> >>> ambiguity (except perhaps that we are aware of it in advance and
it
> >>>
> >> can
> >>
> >>> be easily avoided), as explained here:
> >>> http://dbooth.org/2007/splitting/#httpRange-14
> >>>
> >>> Finally, although it is flattering that you have named this
> >>>
> >> suggestion
> >>
> >>> after me, I cannot take credit.  As I pointed out in my original
> >>>
> >> post,
> >>
> >>> the suggestion to differentiate between a molecule and the
database
> >>> record that describes that molecule originates with the
> Architecture
> >>>
> >> of
> >>
> >>> the World Wide Web:
> >>> http://www.w3.org/TR/webarch/#URI-collision
> >>> and best practices for implementing this distinction are described
> >>>
> > in
> >
> >>> Cool URIs for the Semantic Web:
> >>> http://www.w3.org/TR/cooluris
> >>>
> >>> David Booth
> >>>
> >>>
> >>>
> >
> >
> >
>
Received on Thursday, 26 March 2009 15:34:57 UTC