- From: Satya Sahoo <sahoo.2@wright.edu>
- Date: Sat, 20 Dec 2008 19:12:42 -0500
- To: ashok.malhotra@oracle.com
- Cc: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com>, "public-xg-rdb2rdf@w3.org" <public-xg-rdb2rdf@w3.org>
- Message-id: <6920ea95ac7e.494d43aa@wright.edu>
Hi Ashok, The NCBI Entrez documentation page, http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.605, states that it "...integrates data from a large number of sources, formats, and databases...". It mentions that the “back-end” databases "...might be Sybase or Microsoft SQL Server relational databases of a variety of schemas or text files of various formats". Since, we used the records in NCBI Entrez Gene which are centered around a gene, I am not sure we can verify which part of the record (or indeed the full record) was sourced only from RDB. Cheers, Satya ----- Original Message ----- From: ashok malhotra <ashok.malhotra@oracle.com> Date: Saturday, December 20, 2008 11:30 am Subject: Re: RDB2RDF Usecase - Biomedical UseCase To: Satya Sahoo <sahoo.2@wright.edu> Cc: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com>, "public-xg-rdb2rdf@w3.org" <public-xg-rdb2rdf@w3.org> > Hi Satya: > This is a good usecase but please confirm that the underlying > databases > are relational databases. > I know that some biomedical work uses special-purpose databases, > that's > why I'm asking. > All the best, Ashok > > > Satya Sahoo wrote: > > Hi Ahmed, > > You have pointed to a very critical objective for the RDB2RDF > process > > - data integration and consequently the ability to pose > queries across > > different types of data sources. > > > > In addition to your example, the following is a write-up of > our work I > > had presented to the XG meeting in April 2008 (also cited by > Soren > > Auer in their Triplify work as an example of integration) that > can be > > considered as a data integration use case in the biomedical > domain to > > the Recommendation: > > > > Title: An ontology-driven integration of gene and biological > pathway > > information: Application to the domain of nicotine dependence > > -------------------- > > > > Background: > > Complex biological queries generally require the integration > of > > information from several sources. For example, gene > information > > sources, such as the NCBI Entrez Gene, which has gene-related > records > > of ~2 million genes need to be integrated with pathway > information > > sources, such as KEGG (Kyoto Encyclopedia for Genes and > Genomics). > > Moreover, comparing results across model organisms requires > homology > > information (provided for example by NCBI HomoloGene, > containing > > homology data for several completely sequenced eukaryotic > organisms).> > > In the context of understanding the genetic basis of nicotine > > dependence, we integrate gene and pathway information and show > how > > three complex biological queries can be answered by the > integrated > > knowledge base. > > > > Method: > > We use an ontology-driven approach to integrate two gene > resources > > (Entrez Gene and HomoloGene) and three pathway resources > (KEGG, > > Reactome and BioCyc), for five organisms, including humans. We > created > > the Entrez Knowledge Model (EKoM), an information model in OWL > for the > > gene resources, and integrated it with the extant BioPAX > ontology > > designed for pathway resources. The integrated schema is > populated > > with data from the pathway resources, publicly available in > > BioPAX-compatible format, and gene resources for which a > population > > procedure was created. > > > > The SPARQL query language is used to formulate queries in the > context > > of understanding the genetic basis of nicotine dependence over > the > > integrated knowledge base: > > 1. Which genes participate in a large number of pathways? > > 2. Identify "hub genes" from the perspective of gene interaction? > > 3. Which genes are expressed in the brain, in the context of > > neurobiology of nicotine dependence and various > neurotransmitters in > > the central nervous system? > > > > Implementation: > > The total number of RDF triples generated in the knowledge > base is > > about 1.5 million, with the 334,438 triples from Entrez Gene; > 695,301 > > triples from Reactome; 175,160 triples from BioCyc and 352,793 > triples > > from KEGG. The Oracle 10 g database management system was used > to > > store and query the triples. > > > > Results > > The queries could easily identify hub genes, i.e., those genes > whose > > gene products participate in many pathways or interact with > many other > > gene products. > > > > Reference: http://dx.doi.org/10.1016/j.jbi.2008.02.006 > > > > Cheers, > > Satya > > > > http://knoesis.wright.edu/researchers/satya > > > > ----- Original Message ----- > > From: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com> > > Date: Friday, December 19, 2008 5:16 pm > > Subject: Re: RDB2RDF Usecase > > To: "public-xg-rdb2rdf@w3.org" <public-xg-rdb2rdf@w3.org> > > > > > > > > Hello, > > > > > > One observation I have is we need to be clearer on Rdb2Rdf > for > > solving the silo pain. Rdb2Rdf is a must but not > sufficient > > technology to integrate silos. As you need ot reconcile > the results > > from each data source together before the data is useful > enough to > > apply SPARQL as an example; which is outside the Rdb2Rdf framework. > > > > > > Regarding user scenario, I see a lot of value in the > Enterprise > > Information Management (EIM) area where you integrate data > warehouse > > with content in the enterprise (i.e., not using current > technology of > > NLP + converting to XML then shredding elements in the data > warehouse > > database columns) to be able to return more actionable > information. > > For example, a query to a datawarehouse today can be” “tell me > all > > companies that bought $1M equipments last month” ß easy > one. Now with > > integration of structured and unstructured data in the > enterprise you > > can ask “ tell me all companies that bought $1M equipments and > had > > complaints?” The point here is customer complaints > typically is in > > email content and the list of companies who bough is in the > data > > warehouse. By being able to integrate the results of > search and SQL > > at high-level as RDF sub-graphs, etc, you can answer the 2^nd > question > > transparently w/o manual work. > > > > > > In summary, I suggest to position Rdb2Rdf as a core > technology that > > would help in solving higher level problems like some of the > examples > > in this email thread. > > > Regards, > > > > > > Ahmed > > > > > > > > > > > /*> Ahmed K. Ezzat, Ph.D.*//* > > */*> HP Fellow*, *Business Intelligence Software Division > > **> Hewlett-Packard Corporation > > *> 11000 Wolf Road, Bldg 42 Upper, MS 4502, Cupertino, CA > 95014-0691* > > **> Office*: *Email*: > _Ahmed.Ezzat@hp.com_ > > <javascript:main.compose('new','t=Ahmed.Ezzat@hp.com')> > *Off*: > > 408-447-6380 *Fax*: 1408796-5427 *Cell*: 408-504-2603 > > *> Personal*: *Email*: _AhmedEzzat@aol.com_ > > <javascript:main.compose('new','t=AhmedEzzat@aol.com')> > *Tel*: > > 408-253-5062 *Fax*: 408-253-6271 > > > > > > > > > > >
Received on Sunday, 21 December 2008 00:13:26 UTC