Re: RDB2RDF Usecase - Biomedical UseCase from ashok malhotra on 2008-12-20 (public-xg-rdb2rdf@w3.org from December 2008)

From: ashok malhotra <ashok.malhotra@oracle.com>
Date: Sat, 20 Dec 2008 08:29:38 -0800
To: Satya Sahoo <sahoo.2@wright.edu>
CC: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com>, "public-xg-rdb2rdf@w3.org" <public-xg-rdb2rdf@w3.org>
Message-ID: <494D1D72.4040408@oracle.com>
Hi Satya:
This is a good usecase but please confirm that the underlying databases 
are relational databases.
I know that some biomedical work uses special-purpose databases, that's 
why I'm asking.
All the best, Ashok


Satya Sahoo wrote:
> Hi Ahmed,
> You have pointed to a very critical objective for the RDB2RDF process 
> - data integration and consequently the ability to pose queries across 
> different types of data sources.
>
> In addition to your example, the following is a write-up of our work I 
> had presented to the XG meeting in April 2008 (also cited by Soren 
> Auer in their Triplify work as an example of integration) that can be 
> considered as a data integration use case in the biomedical domain to 
> the Recommendation:
>
> Title: An ontology-driven integration of gene and biological pathway 
> information: Application to the domain of nicotine dependence
> --------------------
>
> Background:
> Complex biological queries generally require the integration of 
> information from several sources. For example, gene information 
> sources, such as the NCBI Entrez Gene, which has gene-related records 
> of ~2 million genes need to be integrated with pathway information 
> sources, such as KEGG (Kyoto Encyclopedia for Genes and Genomics). 
> Moreover, comparing results across model organisms requires homology 
> information (provided for example by NCBI HomoloGene, containing 
> homology data for several completely sequenced eukaryotic organisms).
>
> In the context of understanding the genetic basis of nicotine 
> dependence, we integrate gene and pathway information and show how 
> three complex biological queries can be answered by the integrated 
> knowledge base.
>
> Method:
> We use an ontology-driven approach to integrate two gene resources 
> (Entrez Gene and HomoloGene) and three pathway resources (KEGG, 
> Reactome and BioCyc), for five organisms, including humans. We created 
> the Entrez Knowledge Model (EKoM), an information model in OWL for the 
> gene resources, and integrated it with the extant BioPAX ontology 
> designed for pathway resources. The integrated schema is populated 
> with data from the pathway resources, publicly available in 
> BioPAX-compatible format, and gene resources for which a population 
> procedure was created.
>
> The SPARQL query language is used to formulate queries in the context 
> of understanding the genetic basis of nicotine dependence over the 
> integrated knowledge base:
> 1. Which genes participate in a large number of pathways?
> 2. Identify "hub genes" from the perspective of gene interaction?
> 3. Which genes are expressed in the brain, in the context of 
> neurobiology of nicotine dependence and various neurotransmitters in 
> the central nervous system?
>
> Implementation:
> The total number of RDF triples generated in the knowledge base is 
> about 1.5 million, with the 334,438 triples from Entrez Gene; 695,301 
> triples from Reactome; 175,160 triples from BioCyc and 352,793 triples 
> from KEGG. The Oracle 10 g database management system was used to 
> store and query the triples.
>
> Results
> The queries could easily identify hub genes, i.e., those genes whose 
> gene products participate in many pathways or interact with many other 
> gene products.
>
> Reference: http://dx.doi.org/10.1016/j.jbi.2008.02.006
>
> Cheers,
> Satya
>
> http://knoesis.wright.edu/researchers/satya
>
> ----- Original Message -----
> From: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com>
> Date: Friday, December 19, 2008 5:16 pm
> Subject: Re: RDB2RDF Usecase
> To: "public-xg-rdb2rdf@w3.org" <public-xg-rdb2rdf@w3.org>
>
> >
> > Hello,
> >
> > One observation I have is we need to be clearer on Rdb2Rdf for 
> solving the silo pain.  Rdb2Rdf is a must but not sufficient 
> technology to integrate silos.  As you need ot reconcile the results 
> from each data source together before the data is useful enough to 
> apply SPARQL as an example; which is outside the Rdb2Rdf framework.
> >
> > Regarding user scenario, I see a lot of value in the Enterprise 
> Information Management (EIM) area where you integrate data warehouse 
> with content in the enterprise (i.e., not using current technology of 
> NLP + converting to XML then shredding elements in the data warehouse 
> database columns) to be able to return more actionable information.   
> For example, a query to a datawarehouse today can be” “tell me all 
> companies that bought $1M equipments last month” ß easy one.  Now with 
> integration of structured and unstructured data in the enterprise you 
> can ask “ tell me all companies that bought $1M equipments and had 
> complaints?”  The point here is customer complaints typically is in 
> email content and the list of companies who bough is in the data 
> warehouse.  By being able to integrate the results of search and SQL 
> at high-level as RDF sub-graphs, etc, you can answer the 2^nd question 
> transparently w/o manual work.
> >
> > In summary, I suggest to position Rdb2Rdf as a core technology that 
> would help in solving higher level problems like some of the examples 
> in this email thread.
> > Regards,
> >
> > Ahmed
> >
> >
> >
> /*> Ahmed K. Ezzat, Ph.D.*//* 
> */*> HP Fellow*, *Business Intelligence Software Division
> **> Hewlett-Packard Corporation 
> *> 11000 Wolf Road, Bldg 42 Upper, MS 4502, Cupertino, CA 95014-0691*
> **> Office*:      *Email*: _Ahmed.Ezzat@hp.com_ 
> <javascript:main.compose('new','t=Ahmed.Ezzat@hp.com')> *Off*: 
> 408-447-6380  *Fax*: 1408796-5427  *Cell*: 408-504-2603
> *> Personal*: *Email*: _AhmedEzzat@aol.com_ 
> <javascript:main.compose('new','t=AhmedEzzat@aol.com')> *Tel*: 
> 408-253-5062  *Fax*:  408-253-6271
> >
> >
> >
>
Received on Saturday, 20 December 2008 16:30:50 UTC