Re: Ambiguous names. was: Re: URL +1, LSID -1 from Eric Jain on 2007-07-20 (public-semweb-lifesci@w3.org from July 2007)

From: Eric Jain <Eric.Jain@isb-sib.ch>
Date: Fri, 20 Jul 2007 17:55:48 +0200
To: Alan Ruttenberg <alanruttenberg@gmail.com>
CC: Phillip Lord <phillip.lord@newcastle.ac.uk>, Matthias Samwald <samwald@gmx.at>, public-semweb-lifesci@w3.org
Message-ID: <46A0DB04.3020909@isb-sib.ch>

Alan Ruttenberg wrote:
> "Remember that one of the reasons this came up was the claim that the 
> Uniprot URI should be used to identify a set of real things."

OK, I think that describes my current point of view.

> I get confused when I read statements that sound like "x means the same 
> thing in in all databases, except it might mean something different in a 
> database that isn't Uniprot". I'm sure this isn't what you mean. What do 
> you mean?

"x means the same thing in in all databases" -> not! What UniProt would 
consider to be a "protein" likely differs a bit from what EMBL treats as a 
"protein", which in turn differs from what John Doe considers a "protein".

Since everyone seems to have their own idea of what's the best way to make 
"sets of real things", there doesn't seem to be much of a point in 
distinguishing between the sets and the "records" that describe the sets?

Of course there are often going to be strong correspondences, which is why 
mapping tools are really important, but to think that you could create the 
one true system (TM) that has the "proper" concepts that everyone should 
map to because their databases contain mere records seems like a fallacy!

> I will read "protein" as "protein class", so as not to confuse the set 
> with the individual member of the set, OK?

OK, "protein class". The individual member would be a real "protein 
molecule" that exists somewhere for real, perhaps in a test tube :-)

> When someone makes a statement, such as the ones about the BAG-1 
> isoforms I cite in another message to Phil, I don't think that we should 
> say this is an artificial set of real things.  While it may be the case 
> that there is a certain amount of ambiguity in exactly which set of 
> proteins "BAG-1 p33" identifies, we know some things that I think would 
> be profitable to be conveyed in OWL.

If someone mentions some name like BAG-1, it's not always clear what is 
meant, and in fact this may depend on the field of research of the author. 
Someone with more experience in text mining could probably comment on this.

The "namespace" for "BAG-1" here is the article (being conservative). 
Ideally you'd want to map this to something that is more widely used/known, 
such as HGNC [http://purl.uniprot.org/hgnc/HGNC:937] (specific for human 
stuff), or perhaps even UniProt [http://purl.uniprot.org/uniprot/Q99933].

> For example:
> 
> a) There is no protein that is both a member of the set "BAG-1 p33" 
> identifies and also a member of the set "BAG-1 p33" identifies.
> 
> b) If it turns out at a later date that the properties (e.g. being able 
> to inhibit apoptosis) ascribed to proteins in the set identified by 
> "BAG-1 p33" only were true when the protein was phosphorylated, and some 
> different, conflicting properties(e.g. not being able to inhibit 
> apoptosis) became known of the unphosphorylated ones, then we would have 
> to say that our original statements about "BAG-1 p33" needed to be 
> modified to be statements about the set of proteins identified as  e.g. 
> "phospho BAG-1 p33". I.e. we would name a new set of things: "phospho 
> BAG-1 p33", know it was a subset of the set of things identified as 
> "BAG-1 p33", that it was also disjoint from the set of things identified 
> by "BAG-1 p29". We would be able to answer the question: If we cause 
> "BAG-1 p33" proteins to be overexpressed, but knock out the kinase that 
> phosphorylates such proteins, do we expect(or do we have any evidence to 
> support believing) apoptosis to be inhibited?

Received on Friday, 20 July 2007 15:56:13 UTC