- From: eric neumann <ekneumann@gmail.com>
- Date: Sat, 21 Mar 2009 00:01:14 -0400
- To: marshall@science.uva.nl
- Cc: W3C HCLSIG hcls <public-semweb-lifesci@w3.org>
- Message-ID: <92e86c7d0903202101q2786ec1dt2e9bc6980ffe8038@mail.gmail.com>
Scott, Funny, I was just about to send a message on a very similar issue; may be it's what you're referring to, but let me know either way... After talking with many folks in industry over the last several months, it is becoming quite clear that when dealing with a molecular reference, such as uniprot or entrez-gene, we should also be treating it as a form of "proxy of the thing" with something akin to transitivity. Why, because they are the best reference we have to a protein entity (exemplar). No wonder real-world scientists refer to these records as "the gene" or "the protein". I for one see keeping things from becoming unnecessarily complicated as key to successfully advancing the semantic web in LS. Here are some reasons why we should consider regarding this typing issue: 1. There is no such thing as a referenceble instance of a specific instantiated molecule ("that specific molecule"); all gene, protein, and chemical records are about the category or group of exemplar molecules: SAME molecular structure, NOT SAME atoms (so we already aren't really things in the real world ;-) ); all molecular databases are based on this asserted fact. 2. Most users of molecular information aren't ignorant about the difference between a protein and a record of a protein; it's just that they don't want to deal with all the extra CS mechanics (that prevent getting their job done). And so an instance of a protein record in a database (or a reference to it from another database) is the closest thing to saying: "here's the protein". 3. Different records exist for the same protein, which indeed has been a historic point of complication; but this is really a social issue, not a semantic one, and the key data authorities have already for years coordinated on this point by supplying cross-references to each other. Occasionally, when we realize a gene was incorrectly identified, the record is merged or deprecated, and one group fixes things usually before the other. It would appear that it's beneficial not to coerce the different authorities pre-emptively to point to any other third party über-gene URI; each should correct when it has sufficient evidence, and share that change so that references from each quarter can be corrected. This is also sound form a progression of science perspective; the different agencies through their interactions will eventually find the "better truth" . 4. If one creates a new node or URI for "the gene ABL-Human", and link all other data records to it, it is by any definition 'also' a digital record (even without a URI); hence if one follows this logic to its formal conclusion, we have a system of references about records, that are about records, that are about records... and never quite get to the true instance of a gene. Voila! we've re-created Russell's Paradox using gene records! 5. The body that decides and creates "a higher form of protein record" that others must reference, is going to be suspect by all other authorities; if it is done by committee, I fear it will add a lot more unnecessary confusion; does it get annotated? By whom? How is this regulated by the communities experts and authorities? Do we allow open season for all annotators, but keep everything sequestered in local SW zones? I think this open an interesting but entangled can of worms... I believe it's therefore best not to define protein records types separate from proteins, at least for general consumption by informaticists. Some day this may indeed be easy and useful, but I don't see it being the right thing to invest in right now... So what should we do for now? When should we think about proteins and when about protein records? Well, doesn't that really depend if you are a data source curator like SIB or a consumer of molecular information? Using RDF typing, both can be asserted at the same time, as long as we don't build in any contradictions. EMBL, SIB and NCBI can treat all such records as special "curated record classes", but expose them outwardly as "Gene" or "Protein", or "micro RNA". For most of us who use such online information, this is something that really is not so complicated-- however, when writing new tools to handle new semantic complexities, one almost invariably experiences unpredicted side effects... it's the software that could become confused. I recommend we keep it simpler for now, and don't add semantic features that end-users can not benefit immediately from while making it more complicated to use. cheers, Eric On Fri, Mar 20, 2009 at 1:35 PM, M. Scott Marshall <marshall@science.uva.nl>wrote: > FYI: > http://i9606.blogspot.com/2009/02/semantic-dissonance-in-uniprot.html > > I thought that the above blog entry would interest some of you (it > apparently already has interested a few of you that have added comments :) > ). The blog is from Benjamin Good (from Mark Wilkinson's Lab) and was > referenced during a napkin discussion I had with Marco Roos and Ben about > how one could best refer to a protein in text-mined triples. One of the best > options seemed to be to use a PURL that referred to a record associated with > the protein. Sound familiar? Those of you who have been with us for more > than a year will think so. See http://sharednames.org for an attempt to > approach the issue. > > -Scott > > > >
Received on Saturday, 21 March 2009 04:01:49 UTC