Re: RDB2RDF Usecase - Biomedical UseCase

Hi Ashok,
The NCBI Entrez documentation page, http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.605, states that it "...integrates data from a large number of sources, formats, and databases...". It mentions that the “back-end” databases "...might be Sybase or Microsoft SQL Server relational databases of a variety of schemas or text files of various formats". 
 
Since, we used the records in NCBI Entrez Gene which are centered around a gene, I am not sure we can verify which part of the record (or indeed the full record) was sourced only from RDB. 
 
Cheers,
Satya
----- Original Message -----
From: ashok malhotra <ashok.malhotra@oracle.com>
Date: Saturday, December 20, 2008 11:30 am
Subject: Re: RDB2RDF Usecase - Biomedical UseCase
To: Satya Sahoo <sahoo.2@wright.edu>
Cc: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com>, "public-xg-rdb2rdf@w3.org" <public-xg-rdb2rdf@w3.org>

> Hi Satya:
> This is a good usecase but please confirm that the underlying 
> databases 
> are relational databases.
> I know that some biomedical work uses special-purpose databases, 
> that's 
> why I'm asking.
> All the best, Ashok
> 
> 
> Satya Sahoo wrote:
> > Hi Ahmed,
> > You have pointed to a very critical objective for the RDB2RDF 
> process 
> > - data integration and consequently the ability to pose 
> queries across 
> > different types of data sources.
> >
> > In addition to your example, the following is a write-up of 
> our work I 
> > had presented to the XG meeting in April 2008 (also cited by 
> Soren 
> > Auer in their Triplify work as an example of integration) that 
> can be 
> > considered as a data integration use case in the biomedical 
> domain to 
> > the Recommendation:
> >
> > Title: An ontology-driven integration of gene and biological 
> pathway 
> > information: Application to the domain of nicotine dependence
> > --------------------
> >
> > Background:
> > Complex biological queries generally require the integration 
> of 
> > information from several sources. For example, gene 
> information 
> > sources, such as the NCBI Entrez Gene, which has gene-related 
> records 
> > of ~2 million genes need to be integrated with pathway 
> information 
> > sources, such as KEGG (Kyoto Encyclopedia for Genes and 
> Genomics). 
> > Moreover, comparing results across model organisms requires 
> homology 
> > information (provided for example by NCBI HomoloGene, 
> containing 
> > homology data for several completely sequenced eukaryotic 
> organisms).>
> > In the context of understanding the genetic basis of nicotine 
> > dependence, we integrate gene and pathway information and show 
> how 
> > three complex biological queries can be answered by the 
> integrated 
> > knowledge base.
> >
> > Method:
> > We use an ontology-driven approach to integrate two gene 
> resources 
> > (Entrez Gene and HomoloGene) and three pathway resources 
> (KEGG, 
> > Reactome and BioCyc), for five organisms, including humans. We 
> created 
> > the Entrez Knowledge Model (EKoM), an information model in OWL 
> for the 
> > gene resources, and integrated it with the extant BioPAX 
> ontology 
> > designed for pathway resources. The integrated schema is 
> populated 
> > with data from the pathway resources, publicly available in 
> > BioPAX-compatible format, and gene resources for which a 
> population 
> > procedure was created.
> >
> > The SPARQL query language is used to formulate queries in the 
> context 
> > of understanding the genetic basis of nicotine dependence over 
> the 
> > integrated knowledge base:
> > 1. Which genes participate in a large number of pathways?
> > 2. Identify "hub genes" from the perspective of gene interaction?
> > 3. Which genes are expressed in the brain, in the context of 
> > neurobiology of nicotine dependence and various 
> neurotransmitters in 
> > the central nervous system?
> >
> > Implementation:
> > The total number of RDF triples generated in the knowledge 
> base is 
> > about 1.5 million, with the 334,438 triples from Entrez Gene; 
> 695,301 
> > triples from Reactome; 175,160 triples from BioCyc and 352,793 
> triples 
> > from KEGG. The Oracle 10 g database management system was used 
> to 
> > store and query the triples.
> >
> > Results
> > The queries could easily identify hub genes, i.e., those genes 
> whose 
> > gene products participate in many pathways or interact with 
> many other 
> > gene products.
> >
> > Reference: http://dx.doi.org/10.1016/j.jbi.2008.02.006
> >
> > Cheers,
> > Satya
> >
> > http://knoesis.wright.edu/researchers/satya
> >
> > ----- Original Message -----
> > From: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com>
> > Date: Friday, December 19, 2008 5:16 pm
> > Subject: Re: RDB2RDF Usecase
> > To: "public-xg-rdb2rdf@w3.org" <public-xg-rdb2rdf@w3.org>
> >
> > >
> > > Hello,
> > >
> > > One observation I have is we need to be clearer on Rdb2Rdf 
> for 
> > solving the silo pain.  Rdb2Rdf is a must but not 
> sufficient 
> > technology to integrate silos.  As you need ot reconcile 
> the results 
> > from each data source together before the data is useful 
> enough to 
> > apply SPARQL as an example; which is outside the Rdb2Rdf framework.
> > >
> > > Regarding user scenario, I see a lot of value in the 
> Enterprise 
> > Information Management (EIM) area where you integrate data 
> warehouse 
> > with content in the enterprise (i.e., not using current 
> technology of 
> > NLP + converting to XML then shredding elements in the data 
> warehouse 
> > database columns) to be able to return more actionable 
> information.   
> > For example, a query to a datawarehouse today can be” “tell me 
> all 
> > companies that bought $1M equipments last month” ß easy 
> one.  Now with 
> > integration of structured and unstructured data in the 
> enterprise you 
> > can ask “ tell me all companies that bought $1M equipments and 
> had 
> > complaints?”  The point here is customer complaints 
> typically is in 
> > email content and the list of companies who bough is in the 
> data 
> > warehouse.  By being able to integrate the results of 
> search and SQL 
> > at high-level as RDF sub-graphs, etc, you can answer the 2^nd 
> question 
> > transparently w/o manual work.
> > >
> > > In summary, I suggest to position Rdb2Rdf as a core 
> technology that 
> > would help in solving higher level problems like some of the 
> examples 
> > in this email thread.
> > > Regards,
> > >
> > > Ahmed
> > >
> > >
> > >
> > /*> Ahmed K. Ezzat, Ph.D.*//* 
> > */*> HP Fellow*, *Business Intelligence Software Division
> > **> Hewlett-Packard Corporation 
> > *> 11000 Wolf Road, Bldg 42 Upper, MS 4502, Cupertino, CA 
> 95014-0691*
> > **> Office*:      *Email*: 
> _Ahmed.Ezzat@hp.com_ 
> > <javascript:main.compose('new','t=Ahmed.Ezzat@hp.com')> 
> *Off*: 
> > 408-447-6380  *Fax*: 1408796-5427  *Cell*: 408-504-2603
> > *> Personal*: *Email*: _AhmedEzzat@aol.com_ 
> > <javascript:main.compose('new','t=AhmedEzzat@aol.com')> 
> *Tel*: 
> > 408-253-5062  *Fax*:  408-253-6271
> > >
> > >
> > >
> >

Received on Sunday, 21 December 2008 00:13:26 UTC