Re: RDB2RDF Usecase - Biomedical UseCase

That's good enough.  I have condensed your writeup a bit.  Please take a 
look.

Complex biological queries generally require the integration of 
information from several sources. To understand the genetic basis of 
nicotine dependence, we needed to integrate gene and pathway information 
and answer three complex biological queries using the integrated 
knowledge base.   The gene information source NCBI Entrez Gene, which 
has gene-related records of ~2 million genes needed to be integrated 
with pathway information sources, such as KEGG (Kyoto Encyclopedia for 
Genes and Genomics). Comparing results across model organisms requires 
homology information provided by the NCBI HomoloGene, containing 
homology data for several completely sequenced eukaryotic organisms).

We use an ontology-driven approach to integrate the two gene resources 
Entrez Gene and HomoloGene) and three pathway resources KEGG, Reactome 
and BioCyc. We created the Entrez Knowledge Model (EKoM), an information 
model in OWL for the gene resources, and integrated it with the extant 
BioPAX ontology designed for pathway resources. The integrated schema 
was populated with data from the pathway resources, publicly available 
in BioPAX-compatible format, and gene resources for which a population 
procedure was created.

SPARQL was used to formulate queries to investigate the genetic basis of 
nicotine dependence over the integrated knowledge base:
1. Which genes participate in a large number of pathways?
2. Identify "hub genes" from the perspective of gene interaction?
3. Which genes are expressed in the brain, in the context of 
neurobiology of nicotine dependence and various neurotransmitters in the 
central nervous system?

We found that the queries could easily identify hub genes, i.e., those 
genes whose gene products participate in many pathways or interact with 
many other gene products.

Reference: http://dx.doi.org/10.1016/j.jbi.2008.02.006
All the best, Ashok


Satya Sahoo wrote:
> Hi Ashok,
> The NCBI Entrez documentation page, 
> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.605, 
> states that it "...integrates data from a large number of sources, 
> formats, and databases...". It mentions that the “back-end” databases 
> "...might be Sybase or Microsoft SQL Server relational databases of a 
> variety of schemas or text files of various formats".
>  
> Since, we used the records in NCBI Entrez Gene which are centered 
> around a gene, I am not sure we can verify which part of the record 
> (or indeed the full record) was sourced only from RDB. 
>  
> Cheers,
> Satya
> ----- Original Message -----
> From: ashok malhotra <ashok.malhotra@oracle.com>
> Date: Saturday, December 20, 2008 11:30 am
> Subject: Re: RDB2RDF Usecase - Biomedical UseCase
> To: Satya Sahoo <sahoo.2@wright.edu>
> Cc: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com>, "public-xg-rdb2rdf@w3.org" 
> <public-xg-rdb2rdf@w3.org>
>
> > Hi Satya:
> > This is a good usecase but please confirm that the underlying
> > databases
> > are relational databases.
> > I know that some biomedical work uses special-purpose databases,
> > that's
> > why I'm asking.
> > All the best, Ashok
> >
> >
> > Satya Sahoo wrote:
> > > Hi Ahmed,
> > > You have pointed to a very critical objective for the RDB2RDF
> > process
> > > - data integration and consequently the ability to pose
> > queries across
> > > different types of data sources.
> > >
> > > In addition to your example, the following is a write-up of
> > our work I
> > > had presented to the XG meeting in April 2008 (also cited by
> > Soren
> > > Auer in their Triplify work as an example of integration) that
> > can be
> > > considered as a data integration use case in the biomedical
> > domain to
> > > the Recommendation:
> > >
> > > Title: An ontology-driven integration of gene and biological
> > pathway
> > > information: Application to the domain of nicotine dependence
> > > --------------------
> > >
> > > Background:
> > > Complex biological queries generally require the integration
> > of
> > > information from several sources. For example, gene
> > information
> > > sources, such as the NCBI Entrez Gene, which has gene-related
> > records
> > > of ~2 million genes need to be integrated with pathway
> > information
> > > sources, such as KEGG (Kyoto Encyclopedia for Genes and
> > Genomics).
> > > Moreover, comparing results across model organisms requires
> > homology
> > > information (provided for example by NCBI HomoloGene,
> > containing
> > > homology data for several completely sequenced eukaryotic
> > organisms).>
> > > In the context of understanding the genetic basis of nicotine
> > > dependence, we integrate gene and pathway information and show
> > how
> > > three complex biological queries can be answered by the
> > integrated
> > > knowledge base.
> > >
> > > Method:
> > > We use an ontology-driven approach to integrate two gene
> > resources
> > > (Entrez Gene and HomoloGene) and three pathway resources
> > (KEGG,
> > > Reactome and BioCyc), for five organisms, including humans. We
> > created
> > > the Entrez Knowledge Model (EKoM), an information model in OWL
> > for the
> > > gene resources, and integrated it with the extant BioPAX
> > ontology
> > > designed for pathway resources. The integrated schema is
> > populated
> > > with data from the pathway resources, publicly available in
> > > BioPAX-compatible format, and gene resources for which a
> > population
> > > procedure was created.
> > >
> > > The SPARQL query language is used to formulate queries in the
> > context
> > > of understanding the genetic basis of nicotine dependence over
> > the
> > > integrated knowledge base:
> > > 1. Which genes participate in a large number of pathways?
> > > 2. Identify "hub genes" from the perspective of gene interaction?
> > > 3. Which genes are expressed in the brain, in the context of
> > > neurobiology of nicotine dependence and various
> > neurotransmitters in
> > > the central nervous system?
> > >
> > > Implementation:
> > > The total number of RDF triples generated in the knowledge
> > base is
> > > about 1.5 million, with the 334,438 triples from Entrez Gene;
> > 695,301
> > > triples from Reactome; 175,160 triples from BioCyc and 352,793
> > triples
> > > from KEGG. The Oracle 10 g database management system was used
> > to
> > > store and query the triples.
> > >
> > > Results
> > > The queries could easily identify hub genes, i.e., those genes
> > whose
> > > gene products participate in many pathways or interact with
> > many other
> > > gene products.
> > >
> > > Reference: http://dx.doi.org/10.1016/j.jbi.2008.02.006
> > >
> > > Cheers,
> > > Satya
> > >
> > > http://knoesis.wright.edu/researchers/satya
> > >
> > > ----- Original Message -----
> > > From: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com>
> > > Date: Friday, December 19, 2008 5:16 pm
> > > Subject: Re: RDB2RDF Usecase
> > > To: "public-xg-rdb2rdf@w3.org" <public-xg-rdb2rdf@w3.org>
> > >
> > > >
> > > > Hello,
> > > >
> > > > One observation I have is we need to be clearer on Rdb2Rdf
> > for
> > > solving the silo pain.  Rdb2Rdf is a must but not
> > sufficient
> > > technology to integrate silos.  As you need ot reconcile
> > the results
> > > from each data source together before the data is useful
> > enough to
> > > apply SPARQL as an example; which is outside the Rdb2Rdf framework.
> > > >
> > > > Regarding user scenario, I see a lot of value in the
> > Enterprise
> > > Information Management (EIM) area where you integrate data
> > warehouse
> > > with content in the enterprise (i.e., not using current
> > technology of
> > > NLP + converting to XML then shredding elements in the data
> > warehouse
> > > database columns) to be able to return more actionable
> > information.  
> > > For example, a query to a datawarehouse today can be” “tell me
> > all
> > > companies that bought $1M equipments last month” ß easy
> > one.  Now with
> > > integration of structured and unstructured data in the
> > enterprise you
> > > can ask “ tell me all companies that bought $1M equipments and
> > had
> > > complaints?”  The point here is customer complaints
> > typically is in
> > > email content and the list of companies who bough is in the
> > data
> > > warehouse.  By being able to integrate the results of
> > search and SQL
> > > at high-level as RDF sub-graphs, etc, you can answer the 2^nd
> > question
> > > transparently w/o manual work.
> > > >
> > > > In summary, I suggest to position Rdb2Rdf as a core
> > technology that
> > > would help in solving higher level problems like some of the
> > examples
> > > in this email thread.
> > > > Regards,
> > > >
> > > > Ahmed
> > > >
> > > >
> > > >
> > > /*> Ahmed K. Ezzat, Ph.D.*//*
> > > */*> HP Fellow*, *Business Intelligence Software Division
> > > **> Hewlett-Packard Corporation
> > > *> 11000 Wolf Road, Bldg 42 Upper, MS 4502, Cupertino, CA
> > 95014-0691*
> > > **> Office*:      *Email*:
> > _Ahmed.Ezzat@hp.com_
> > > <javascript:main.compose('new','t=Ahmed.Ezzat@hp.com')>
> > *Off*:
> > > 408-447-6380  *Fax*: 1408796-5427  *Cell*: 408-504-2603
> > > *> Personal*: *Email*: _AhmedEzzat@aol.com_
> > > <javascript:main.compose('new','t=AhmedEzzat@aol.com')>
> > *Tel*:
> > > 408-253-5062  *Fax*:  408-253-6271
> > > >
> > > >
> > > >
> > > 

Received on Sunday, 21 December 2008 00:36:25 UTC