Semantic web for life sciences: vision vs. reality from Greg Tyrelle on 2005-09-28 (public-semweb-lifesci@w3.org from September 2005)

From: Greg Tyrelle <greg@tyrelle.net>
Date: Wed, 28 Sep 2005 14:35:57 -0400
To: public-semweb-lifesci@w3.org
Message-ID: <20050928183557.GA30776@mail.tyrelle.com>
On Wed, 28 Sep 2005, Melissa Cline wrote:
> One beauty of SW is that the "over-arching layer" doesn't really need to
> know any details about the sub-domain ontologies.  This is the classic

<snip>

> infrequently, with the results kept for a long time.  SW helps avoid this
> through a framework flexible enough to contain combined ontologies (RDF),
> and mechanisms such as LSIDs to serve as unique global identifiers.  So it
> becomes easier for us to keep the data distributed, and in small,
> lightweight ontologies that we can combine when we need.

I'm sorry but I don't agree. Not to take you to task personally or
anything, but the description you've given is largely fantasy. To make
it more accurate you would have to say "One beauty of the SW *vision*
is...", because your description certainly doesn't square with the
current reality.

It is true that you can *combine* two DAGs and end up with another
DAG. For example, it is rather trivial to collect pathway data from say
Reactome in BioPAX format, and combine that with uniprot data
[1]. However *merging* shared resources between the two DAGs is
non-trivial.

The data model that UniProt uses is record based, each protein has a
unique LSID identifier, where as the Reactome data uses bnodes for
protein identifiers. The protein resources in BioPAX have cross
reference properties but these point to literal UniProt identifiers
not LSID UniProt identifiers. That means I have no way of knowing that
the resource identified by 'urn:lsid:UniProt.org:UniProt:Q96LC9' in
the UniProt data is the same as the resource identified by
'UniProt_Q96LC9_BMF_protein' in the reactome data (besides the fact
that bnodes are not globally unique).

Sure I could email the Reactome developers and ask them to use LSID
identifiers, but that is not really a scalable solution. In my opinion
the advantage of the semantic web is that it intends to be a scalable
general solution to data integration. So I'm not against LSIDs per se,
they may help in some limited or constrained environments, but not in
the wild. At the end of the day this merging problem will come up
again and again for biologic applications of the semantic web
(i.e. the identifier mapping problem all over again).

But what about ontologies ? In Biology they seem to be having their
own problems at the moment [2]. Regardless ontologies won't help much
in the Reactome UniProt use case. All I can really do is create a
mapping rule that the protein class in both ontolgies are
equal. However I can go further and create mapping rules that will
ultimately merge the data. For example use the unificationXref data
(DB, ID) as inverse functional properties. But in that case why bother
with the semantic web at all ? Data warehousing or any one of a number
of technologies that use ad-hoc data mapping rules would do fine.

The answer is again, because the semantic web (combined with AI
techniques) will hopefully be a general scalable solution to merging
data, reasoning over data and learning from data. 

So how to get there ?

I  believe  this  general  problem  I have  mentioned  here  is  termed
'identity uncertainty'. Some interesting collections of papers dealing
with this problem can be found here:

http://www.smi.ucd.ie/Dagstuhl-MLSW/proceedings/
http://blog.ilrt.org/price/archives/cat_papers.html

Although it is still an open problem for biology and the semantic
web. Thoughts ?

_greg

[1] http://www.nodalpoint.org/node/1704
[2] http://www.nature.com/nbt/journal/v23/n9/full/nbt0905-1095.html

--
Greg Tyelle
Received on Wednesday, 28 September 2005 18:36:24 UTC