- From: Greg Tyrelle <greg@tyrelle.net>
- Date: Wed, 28 Sep 2005 14:35:57 -0400
- To: public-semweb-lifesci@w3.org
On Wed, 28 Sep 2005, Melissa Cline wrote: > One beauty of SW is that the "over-arching layer" doesn't really need to > know any details about the sub-domain ontologies. This is the classic <snip> > infrequently, with the results kept for a long time. SW helps avoid this > through a framework flexible enough to contain combined ontologies (RDF), > and mechanisms such as LSIDs to serve as unique global identifiers. So it > becomes easier for us to keep the data distributed, and in small, > lightweight ontologies that we can combine when we need. I'm sorry but I don't agree. Not to take you to task personally or anything, but the description you've given is largely fantasy. To make it more accurate you would have to say "One beauty of the SW *vision* is...", because your description certainly doesn't square with the current reality. It is true that you can *combine* two DAGs and end up with another DAG. For example, it is rather trivial to collect pathway data from say Reactome in BioPAX format, and combine that with uniprot data [1]. However *merging* shared resources between the two DAGs is non-trivial. The data model that UniProt uses is record based, each protein has a unique LSID identifier, where as the Reactome data uses bnodes for protein identifiers. The protein resources in BioPAX have cross reference properties but these point to literal UniProt identifiers not LSID UniProt identifiers. That means I have no way of knowing that the resource identified by 'urn:lsid:UniProt.org:UniProt:Q96LC9' in the UniProt data is the same as the resource identified by 'UniProt_Q96LC9_BMF_protein' in the reactome data (besides the fact that bnodes are not globally unique). Sure I could email the Reactome developers and ask them to use LSID identifiers, but that is not really a scalable solution. In my opinion the advantage of the semantic web is that it intends to be a scalable general solution to data integration. So I'm not against LSIDs per se, they may help in some limited or constrained environments, but not in the wild. At the end of the day this merging problem will come up again and again for biologic applications of the semantic web (i.e. the identifier mapping problem all over again). But what about ontologies ? In Biology they seem to be having their own problems at the moment [2]. Regardless ontologies won't help much in the Reactome UniProt use case. All I can really do is create a mapping rule that the protein class in both ontolgies are equal. However I can go further and create mapping rules that will ultimately merge the data. For example use the unificationXref data (DB, ID) as inverse functional properties. But in that case why bother with the semantic web at all ? Data warehousing or any one of a number of technologies that use ad-hoc data mapping rules would do fine. The answer is again, because the semantic web (combined with AI techniques) will hopefully be a general scalable solution to merging data, reasoning over data and learning from data. So how to get there ? I believe this general problem I have mentioned here is termed 'identity uncertainty'. Some interesting collections of papers dealing with this problem can be found here: http://www.smi.ucd.ie/Dagstuhl-MLSW/proceedings/ http://blog.ilrt.org/price/archives/cat_papers.html Although it is still an open problem for biology and the semantic web. Thoughts ? _greg [1] http://www.nodalpoint.org/node/1704 [2] http://www.nature.com/nbt/journal/v23/n9/full/nbt0905-1095.html -- Greg Tyelle
Received on Wednesday, 28 September 2005 18:36:24 UTC