- From: Eric Neumann <ENeumann@BeyondGenomics.com>
- Date: Mon, 3 Nov 2003 22:31:31 -0500
- To: <public-semweb-lifesci@w3.org>
Thought this was worth cross-posting, since an RDF model will have to face the same issues Joanne has raised... Question: Can RDF + LSID help solve this data merger problem? Eric -----Original Message----- From: Joanne Luciano [mailto:jluciano@predmed.com] Sent: Monday, November 03, 2003 7:11 PM To: biopax-discuss@biopax.org; biopax-eg@biopax.org Subject: [BioPAX-discuss] Data, Data integration, tools,& identifying entities as the same Hi Community, I've been working on creating examples in BioPAX and in the process of creating a simple example of how one might annotate an existing piece of a pathway with new information I ran into a problem. I didn't have a way to represent that two entities in different files (databases) referred to the same thing. I provide an example below, but want to mention that it raises other important questions we need to address also: Should we provide tools for data integration? If so, when? What specific functions would these tools perform? Does anything in the ontology need to be altered to better facilitate data integration?" Here's an example: Consider the first step in glycolysis is: conversion of glucose-6-phosphate to fructose-6-phosphate, catalyzed by the enzyme phosphoglucose isomerase. (See: <http://biocyc.org:1555/ECOLI/new-image?type=PATHWAY&object=GLYCOLYSIS&d etai l-levek=3>) Now let's assume that the conversion reaction of glucose-6-phosphate to fructose-6-phosphate is in one database (in biopax format) and the enzyme that catalyzes that reaction, namely, phosphoglucose isomerase, is in another database (in biopax format). How do we merge the two? How do we identify in the data that we are referring to the same glucoses? I have some owl code if anyone is interested, but because each is entered as a separate instance, it has a unique identifier in each database. Because the identifiers are unique, there's now way to tell they are referring to the same pathway. One possible answer may be to implement a Life Sciences Identifier. However it's not clear yet if this is a solution. The Life Sciences Identifier (LSID) is an I3C Uniform Resource Name (URN) specification in progress. You can read more about the specification at <http://www.omg.org/cgi-bin/doc?lifesci/03-05-01>. Conceptually, LSID is an approach to naming and identifying data resources stored in multiple, distributed data stores in a manner that overcomes the limitations of the naming schemes that are in use today. Martin Senger at EBI has a power point presentation that highlights the key concepts. <http://industry.ebi.ac.uk/~senger/talks/Life_Sciences_Identifiers.ppt>. An example application that IBM put together is called LSI Launchpad: <http://www-124.ibm.com/developerworks/opensource/lsid/> Mike Cary, spelled out the various ways we would need to integrate data, and how each poses specific challenges: 1) Two BioPAX files may be completely disjoint (actually, this one is easy - just merge them). 2) Two BioPAX files may contain several identical entries (also easy, just knock out the repeats and then merge them). 3) Two BioPAX files may contain similar entries for the same things, e.g. they both have a small molecule called glucose but the two entries are slightly different (slightly harder - gotta map them and then join them). 4) Two BioPAX files may contain entries for the same things, but these entries may be of different classes (e.g. one has an entry as a protein-protein interaction, the other captures the same concept as a reaction). This is very hard - especially going from PPI to RxN. 5) Two BioPAX files may each contain partial (incomplete) information about a pathway, how do we integrate the parts to form a whole? This might be tricky if we needed to resolve interaction identities across multiple levels of abstraction. Mike adds: This is not an easy problem, and it may be beyond the scope of the current project (the whole reason we want a central pathway DB is because it's so hard to integrate data from various sources - even if they express their data in the same language). I don't know if the LSID is a solution. My gut tells me that we want BioPAX to express interaction and pathway information in a DB-neutral way - otherwise all we are doing is mapping (e.g. "BIND entry 4536 is the same as KEGG entry 5362", "EcoCyc entry 734 is the same as WIT 6637"), which is not what a DEF is meant to do. The solution will require good software. I (Joanne) don't know if the LSID is a solution either, but I think it is meant to be DB neutral. Also thanks to Erik Brauner for the discussion which led me to post this to the list. Joanne _______________________________________________ BioPAX-discuss mailing list BioPAX-discuss@biopax.org http://www.biopax.org/mailman/listinfo/biopax-discuss ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Eric K. Neumann PhD VP Strategic Informatics, Head of Knowledge Research Beyond Genomics 40 Bear Hill Road Waltham, MA tel: 781-434-0222 fax: 781-895-1119 www.beyondgenomics.com
Received on Monday, 3 November 2003 22:32:58 UTC