RE: blog: semantic dissonance in uniprot from Michel_Dumontier on 2009-03-21 (public-semweb-lifesci@w3.org from March 2009)

From: Michel_Dumontier <Michel_Dumontier@carleton.ca>
Date: Sat, 21 Mar 2009 13:49:52 -0400
To: "W3C HCLSIG hcls" <public-semweb-lifesci@w3.org>
Message-ID: <AB349814F1ECB143A5D4CD29C7A6456903A3ABFE@CCSEXB10.CUNET.CARLETON.CA>
Eric and friends,

 

 I'm very sympathetic to the simplifying assumption of not distinguishing between a record and the molecular entity it represents, but there are some important considerations. First, we need to be cautious in the transformation of recorded facts (as they appear in these database records) to class restrictions on biomolecules in logic-based (e.g. OWL) ontologies. Initially, we might say that a class biomolecules share a particular molecular structure (or biopolymer sequence), but assertions of role, function, PTMs, and involvement in biological process (among others) are contextual or temporally qualified and as such it may not be appropriate to  generalize to all instances. For example, some protein records list all of the _known_ PTMs .. hardly the basis to generalize that all instances will also have those PTMs at those positions at all (or any!) time. This is clearly a major knowledge representation challenge, in which we should engage in different approaches to improve our representation. Class-based representations are necessary as there is a need to refer to specific real world instances, whether they be collections of molecules in a test tube, electron micrographs that show individual macromolecular complexes or atomic force microscopes that manipulate them. In the meantime,  we'll probably continue to model database records as instances of their corresponding entity.

 

 There is no doubt that it is challenging to devise a consistent naming scheme - and nearly each member of the steering group has worked out some way to do this (e.g. [1][2]). If the sharednames group wants to recommend an consensual approach on the _syntax_ of any given name, with appropriate rationale, then it's possible that more people will use it as a guiding principle. However, attempts to _control_ the naming process will result in an undoubtedly unreceptive audience. Will a registry of names prevent people from making similar or identical (literal) names?  no. Establishing a self-registry of namespaces like bio2rdf [3] or lsrn.org is a more worthy goal. I, like several others, am interested to see how the committee will "make sure that its URIs ... resolve to information that is useful". I expect that this will be challenging to establish utility, particularly in the context of a term contained in an expressive ontology.

 

 I applaud efforts to publish data in an open and linked manner. But somewhat disconcerting is that I'm (controversially) sure we'll find ourselves in the awkward position that there will be too much meaningless linked data, in which we'll have to filter useful, less useful, to identical, useless or worse, misguiding or erroneous. It's not hard to imagine this happening. Applying the correct semantics to create meaningful relations is of fundamental importance for answering questions about our collective knowledge. Linking concepts or data with clearly defined semantic links (e.g. SKOS, RO, OWL) is  indeed useful, and its utility goes beyond Linked Data. Eric's appeal, that we should be careful to (meaningfully) link to third party über- URIs, resonates for the same reason that you may want to say something about an entity that other people won't necessarily agree with. The truth is that we all have different perceptions of reality, and our knowledge about the world is in constant flux. We should be able to express our knowledge to our degree of satisfaction. In a competitive, distributed environment that is the web, people will choose terms and ontologies that best agrees with their perception and with their requirements. As a nascent scientific community, so early in the game of designing accurate, expressive and meaningful ontologies, we should encourage new ideas and ensure competition among them.

 

-=Michel=-

 

http://dumontierlab.com

 

 

[1] http://bio2rdf.wiki.sourceforge.net/Banff+Manifesto

[2] http://sw.neurocommons.org/2007/uri-explanation.html 

[3] 

 

 

 

From: public-semweb-lifesci-request@w3.org [mailto:public-semweb-lifesci-request@w3.org] On Behalf Of eric neumann
Sent: Saturday, March 21, 2009 12:01 AM
To: marshall@science.uva.nl
Cc: W3C HCLSIG hcls
Subject: Re: blog: semantic dissonance in uniprot

 

Scott,

 

Funny, I was just about to send a message on a very similar issue; may be it's what you're referring to, but let me know either way...

 

After talking with many folks in industry over the last several months, it is becoming quite clear that when dealing with a molecular reference, such as uniprot or entrez-gene, we should also be treating it as a form of "proxy of the thing" with something akin to transitivity. Why, because they are the best reference we have to a protein entity (exemplar). No wonder real-world scientists refer to these records as "the gene" or "the protein". I for one see keeping things from becoming unnecessarily complicated as key to successfully advancing the semantic web in LS.

 

Here are some reasons why we should consider regarding this typing issue:

1.	There is no such thing as a referenceble instance of a specific instantiated molecule ("that specific molecule"); all gene, protein, and chemical records are about the category or group of exemplar molecules: SAME molecular structure, NOT SAME atoms (so we already aren't really things in the real world ;-) ); all molecular databases are based on this asserted fact.
2.	Most users of molecular information aren't ignorant about the difference between a protein and a record of a protein; it's just that they don't want to deal with all the extra CS mechanics (that prevent getting their job done). And so an instance of a protein record in a database (or a reference to it from another database) is the closest thing to saying: "here's the protein".
3.	Different records exist for the same protein, which indeed has been a historic point of complication; but this is really a social issue, not a semantic one, and the key data authorities have already for years coordinated on this point by supplying cross-references to each other. Occasionally, when we realize a gene was incorrectly identified, the record is merged or deprecated, and one group fixes things usually before the other. It would appear that it's beneficial not to coerce the different authorities pre-emptively to point to any other third party über-gene URI; each should correct when it has sufficient evidence, and share that change so that references from each quarter can be corrected. This is also sound form a progression of science perspective; the different agencies through their interactions will eventually find the "better truth" .
4.	If one creates a new node or URI for "the gene ABL-Human", and link all other data records to it,  it is by any definition 'also' a digital record (even without a URI); hence if one follows this logic to its formal conclusion, we have a system of references about records, that are about records, that are about records... and never quite get to the true instance of a gene. Voila! we've re-created Russell's Paradox using gene records! 
5.	The body that decides and creates "a higher form of protein record" that others must reference, is going to be suspect by all other authorities; if it is done by committee, I fear it will add a lot more unnecessary confusion; does it get annotated? By whom? How is this regulated by the communities experts and authorities? Do we allow open season for all annotators, but keep everything sequestered in local SW zones? I think this open an interesting but entangled can of worms...

I believe it's therefore best not to define protein records types separate from proteins, at least for general consumption by informaticists. Some day this may indeed be easy and useful, but I don't see it being the right thing to invest in right now...

 

So what should we do for now? When should we think about proteins and when about protein records? Well, doesn't that really depend if you are a data source curator like SIB or a consumer of molecular information?  Using RDF typing, both can be asserted at the same time, as long as we don't build in any contradictions. EMBL, SIB and NCBI can treat all such records as special "curated record classes", but expose them outwardly as "Gene" or "Protein", or "micro RNA".

 

For most of us who use such online information, this is something that really is not so complicated-- however, when writing new tools to handle new semantic complexities, one almost invariably experiences unpredicted side effects... it's the software that could become confused. I recommend we keep it simpler for now, and don't add semantic features that end-users can not benefit immediately from while making it more complicated to use.

 

cheers,

Eric

 

On Fri, Mar 20, 2009 at 1:35 PM, M. Scott Marshall <marshall@science.uva.nl> wrote:

FYI:
http://i9606.blogspot.com/2009/02/semantic-dissonance-in-uniprot.html

I thought that the above blog entry would interest some of you (it apparently already has interested a few of you that have added comments :) ). The blog is from Benjamin Good (from Mark Wilkinson's Lab) and was referenced during a napkin discussion I had with Marco Roos and Ben about how one could best refer to a protein in text-mined triples. One of the best options seemed to be to use a PURL that referred to a record associated with the protein. Sound familiar? Those of you who have been with us for more than a year will think so. See http://sharednames.org for an attempt to approach the issue.

-Scott
Received on Saturday, 21 March 2009 17:51:14 UTC