FW: [BioPAX-discuss] Data, Data integration, tools,& identifying entities as the same

From: Eric Neumann <ENeumann@BeyondGenomics.com> · Date: Mon, 3 Nov 2003 22:31:31 -0500

Thought this was worth cross-posting, since an RDF model will have to
face the same issues Joanne has raised... 

Question: Can RDF + LSID help solve this data merger problem?

Eric

-----Original Message-----
From: Joanne Luciano [mailto:jluciano@predmed.com] 
Sent: Monday, November 03, 2003 7:11 PM
To: biopax-discuss@biopax.org; biopax-eg@biopax.org
Subject: [BioPAX-discuss] Data, Data integration, tools,& identifying
entities as the same

Hi Community,

I've been working on creating examples in BioPAX and in the process of
creating a simple example of how one might annotate an existing piece of
a pathway with new information I ran into a problem.  I didn't have a
way to represent that two entities in different files (databases)
referred to the same thing.  I provide an example below, but want to
mention that it raises other important questions we need to address
also:

Should we provide tools for data integration?  If so, when? What
specific functions would these tools perform? Does anything in the
ontology need to be altered to better facilitate data integration?"

Here's an example:

Consider the first step in glycolysis is:
conversion of glucose-6-phosphate to fructose-6-phosphate, catalyzed by
the enzyme phosphoglucose isomerase.
(See:
<http://biocyc.org:1555/ECOLI/new-image?type=PATHWAY&object=GLYCOLYSIS&d
etai
l-levek=3>)

Now let's assume that the conversion reaction of glucose-6-phosphate to
fructose-6-phosphate is in one database (in biopax format) and the
enzyme that catalyzes that reaction, namely, phosphoglucose isomerase,
is in another database (in biopax format).  How do we merge the two? How
do we identify in the data that we are referring to the same glucoses?

I have some owl code if anyone is interested, but because each is
entered as a separate instance, it has a unique identifier in each
database. Because the identifiers are unique, there's now way to tell
they are referring to the same pathway.

One possible answer may be to implement a Life Sciences Identifier.
However it's not clear yet if this is a solution.  The Life Sciences
Identifier
(LSID) is an I3C Uniform Resource Name (URN) specification in progress.
You can read more about the specification at
<http://www.omg.org/cgi-bin/doc?lifesci/03-05-01>. Conceptually, LSID is
an approach to naming and identifying data resources stored in multiple,
distributed data stores in a manner that overcomes the limitations of
the naming schemes that are in use today. Martin Senger at EBI has a
power point presentation that highlights the key concepts.
<http://industry.ebi.ac.uk/~senger/talks/Life_Sciences_Identifiers.ppt>.
An example application that IBM put together is called  LSI Launchpad:
<http://www-124.ibm.com/developerworks/opensource/lsid/>

Mike Cary, spelled out the various ways we would need to integrate data,
and how each poses specific challenges:
1)	Two BioPAX files may be completely disjoint (actually, this one
is easy -
just merge them).
2)	Two BioPAX files may contain several identical entries (also
easy, just
knock out the repeats and then merge them).
3)	Two BioPAX files may contain similar entries for the same
things, e.g.
they both have a small molecule called glucose but the two entries are
slightly different (slightly harder - gotta map them and then join
them).
4)	Two BioPAX files may contain entries for the same things, but
these
entries may be of different classes (e.g. one has an entry as a
protein-protein interaction, the other captures the same concept as a
reaction).  This is very hard - especially going from PPI to RxN.
5)	Two BioPAX files may each contain partial (incomplete)
information about
a pathway, how do we integrate the parts to form a whole?  This might be
tricky if we needed to resolve interaction identities across multiple
levels of abstraction.

Mike adds:  This is not an easy problem, and it may be beyond the scope
of the current project (the whole reason we want a central pathway DB is
because it's so hard to integrate data from various sources - even if
they express their data in the same language).  I don't know if the LSID
is a solution.  My gut tells me that we want BioPAX to express
interaction and pathway information in a DB-neutral way - otherwise all
we are doing is mapping (e.g. "BIND entry 4536 is the same as KEGG entry
5362", "EcoCyc entry 734 is the same as WIT 6637"), which is not what a
DEF is meant to do. The solution will require good software.

I (Joanne) don't know if the LSID is a solution either, but I think it
is meant to be DB neutral.

Also thanks to Erik Brauner for the discussion which led me to post this
to the list.

Joanne

_______________________________________________
BioPAX-discuss mailing list
BioPAX-discuss@biopax.org
http://www.biopax.org/mailman/listinfo/biopax-discuss

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Eric K. Neumann PhD
    VP Strategic Informatics,
    Head of Knowledge Research

   Beyond Genomics

    40 Bear Hill Road
    Waltham, MA
     tel: 781-434-0222
     fax: 781-895-1119
     www.beyondgenomics.com