Re: blog: semantic dissonance in uniprot from eric neumann on 2009-03-21 (public-semweb-lifesci@w3.org from March 2009)

From: eric neumann <ekneumann@gmail.com>
Date: Sat, 21 Mar 2009 00:01:14 -0400
To: marshall@science.uva.nl
Cc: W3C HCLSIG hcls <public-semweb-lifesci@w3.org>
Message-ID: <92e86c7d0903202101q2786ec1dt2e9bc6980ffe8038@mail.gmail.com>
Scott,
Funny, I was just about to send a message on a very similar issue; may be
it's what you're referring to, but let me know either way...

After talking with many folks in industry over the last several months, it
is becoming quite clear that when dealing with a molecular reference, such
as uniprot or entrez-gene, we should also be treating it as a form of "proxy
of the thing" with something akin to transitivity. Why, because they are the
best reference we have to a protein entity (exemplar). No wonder real-world
scientists refer to these records as "the gene" or "the protein". I for one
see keeping things from becoming unnecessarily complicated as key to
successfully advancing the semantic web in LS.

Here are some reasons why we should consider regarding this typing issue:

   1. There is no such thing as a referenceble instance of a specific
   instantiated molecule ("that specific molecule"); all gene, protein, and
   chemical records are about the category or group of exemplar molecules:
   SAME molecular structure, NOT SAME atoms (so we already aren't really things
   in the real world ;-) ); all molecular databases are based on this asserted
   fact.
   2. Most users of molecular information aren't ignorant about the
   difference between a protein and a record of a protein; it's just that they
   don't want to deal with all the extra CS mechanics (that prevent getting
   their job done). And so an instance of a protein record in a database (or a
   reference to it from another database) is the closest thing to saying:
   "here's the protein".
   3. Different records exist for the same protein, which indeed has been a
   historic point of complication; but this is really a social issue, not a
   semantic one, and the key data authorities have already for years
   coordinated on this point by supplying cross-references to each
   other. Occasionally, when we realize a gene was incorrectly identified, the
   record is merged or deprecated, and one group fixes things usually before
   the other. It would appear that it's beneficial not to coerce the different
   authorities pre-emptively to point to any other third party über-gene URI;
   each should correct when it has sufficient evidence, and share that change
   so that references from each quarter can be corrected. This is also sound
   form a progression of science perspective; the different agencies through
   their interactions will eventually find the "better truth" .
   4. If one creates a new node or URI for "the gene ABL-Human", and link
   all other data records to it,  it is by any definition 'also' a digital
   record (even without a URI); hence if one follows this logic to its formal
   conclusion, we have a system of references about records, that are about
   records, that are about records... and never quite get to the true instance
   of a gene. Voila! we've re-created Russell's Paradox using gene records!
   5. The body that decides and creates "a higher form of protein record"
   that others must reference, is going to be suspect by all other authorities;
   if it is done by committee, I fear it will add a lot
   more unnecessary confusion; does it get annotated? By whom? How is this
   regulated by the communities experts and authorities? Do we allow open
   season for all annotators, but keep everything sequestered in local SW
   zones? I think this open an interesting but entangled can of worms...

I believe it's therefore best not to define protein records types separate
from proteins, at least for general consumption by informaticists. Some day
this may indeed be easy and useful, but I don't see it being the right thing
to invest in right now...

So what should we do for now? When should we think about proteins and when
about protein records? Well, doesn't that really depend if you are a data
source curator like SIB or a consumer of molecular information?  Using RDF
typing, both can be asserted at the same time, as long as we don't build in
any contradictions. EMBL, SIB and NCBI can treat all such records as special
"curated record classes", but expose them outwardly as "Gene" or "Protein",
or "micro RNA".

For most of us who use such online information, this is something that
really is not so complicated-- however, when writing new tools to handle new
semantic complexities, one almost invariably experiences unpredicted side
effects... it's the software that could become confused. I recommend we keep
it simpler for now, and don't add semantic features that end-users can not
benefit immediately from while making it more complicated to use.

cheers,
Eric


On Fri, Mar 20, 2009 at 1:35 PM, M. Scott Marshall
<marshall@science.uva.nl>wrote:

> FYI:
> http://i9606.blogspot.com/2009/02/semantic-dissonance-in-uniprot.html
>
> I thought that the above blog entry would interest some of you (it
> apparently already has interested a few of you that have added comments :)
> ). The blog is from Benjamin Good (from Mark Wilkinson's Lab) and was
> referenced during a napkin discussion I had with Marco Roos and Ben about
> how one could best refer to a protein in text-mined triples. One of the best
> options seemed to be to use a PURL that referred to a record associated with
> the protein. Sound familiar? Those of you who have been with us for more
> than a year will think so. See http://sharednames.org for an attempt to
> approach the issue.
>
> -Scott
>
>
>
>
Received on Saturday, 21 March 2009 04:01:49 UTC