RE: RDB2RDF Usecase - Biomedical UseCase

Hello Satya,

This sounds right.  There are similar examples on few verticals including Telco, Energy, Retail, HLS & Healthcare, and manufacturing; the key thing here is integrating structured and unstructured data across multiple data sources (yes, you will need to use ontologies)...

A second angle to this is what is called in the BI space as "Operational BI."  The main distinguishing feature is "near realtime" access to data and generate actionable analytics.  Most vendors today rely on some enhancement in their ETL (typically based on CDC - change data capture) and claim near-realtime.  And of course you need to capture changes form the different data sources into the warehouse.  This approach, even though it is "used by all" is really a broken model, and as a result most still use a hybrid of some incremental update with the traditional batch window overnight.   I believe if you look at the problem as EII (i.e., not EAI and not traditional data integration where you gather the data from the different sources into a warehouse using traditional ETL the way it is used today) and instead using semantic web integration layer above the data sources (RDF store + SPARQL) then you have an opportunity to achieve the TRUE Operational BI goal.  You do not need to continuously feed the warehouse with all changes from the other data sources in the Enterprise (analysts estimate average number of data sources in the enterprise to between 20-30 data sources).  When you translate the user query into a federated Subqueries (some SQL and the rest are search queries) you always access data from the data sources in realtime, and there is no heavy burden on the RDF store from scalability point of view.   This approach has the potential to change the rules of the game...

Think of email server content as one data source, using the traditional ETL approach, i.e., every time you delete an email or receive a new email you need to update the data warehouse, i.e., otherwise you are not making decisions based on the most current data!!  The above approach is definitely much more promising and scalable approach....

Another observation when we talk about structured and unstructured data, most analysts will tell you that the volume of unstructured (content) to structured (SQL) is around 85 : 15 percent or 80 : 20 percent.  However, the value of the 15-20% structured data in the enterprise is much more precious, cleaner and typically well behaving data.  This makes the mapping from SQL to RDF as critical...
Regards,

Ahmed


From: Satya Sahoo [mailto:sahoo.2@wright.edu]
Sent: Friday, December 19, 2008 16:32
To: Ezzat, Ahmed
Cc: public-xg-rdb2rdf@w3.org
Subject: Re: RDB2RDF Usecase - Biomedical UseCase

Hi Ahmed,
You have pointed to a very critical objective for the RDB2RDF process - data integration and consequently the ability to pose queries across different types of data sources.

In addition to your example, the following is a write-up of our work I had presented to the XG meeting in April 2008 (also cited by Soren Auer in their Triplify work as an example of integration) that can be considered as a data integration use case in the biomedical domain to the Recommendation:

Title: An ontology-driven integration of gene and biological pathway information: Application to the domain of nicotine dependence
--------------------

Background:
Complex biological queries generally require the integration of information from several sources. For example, gene information sources, such as the NCBI Entrez Gene, which has gene-related records of ~2 million genes need to be integrated with pathway information sources, such as KEGG (Kyoto Encyclopedia for Genes and Genomics). Moreover, comparing results across model organisms requires homology information (provided for example by NCBI HomoloGene, containing homology data for several completely sequenced eukaryotic organisms).

In the context of understanding the genetic basis of nicotine dependence, we integrate gene and pathway information and show how three complex biological queries can be answered by the integrated knowledge base.

Method:
We use an ontology-driven approach to integrate two gene resources (Entrez Gene and HomoloGene) and three pathway resources (KEGG, Reactome and BioCyc), for five organisms, including humans. We created the Entrez Knowledge Model (EKoM), an information model in OWL for the gene resources, and integrated it with the extant BioPAX ontology designed for pathway resources. The integrated schema is populated with data from the pathway resources, publicly available in BioPAX-compatible format, and gene resources for which a population procedure was created.

The SPARQL query language is used to formulate queries in the context of understanding the genetic basis of nicotine dependence over the integrated knowledge base:
1. Which genes participate in a large number of pathways?
2. Identify "hub genes" from the perspective of gene interaction?
3. Which genes are expressed in the brain, in the context of neurobiology of nicotine dependence and various neurotransmitters in the central nervous system?

Implementation:
The total number of RDF triples generated in the knowledge base is about 1.5 million, with the 334,438 triples from Entrez Gene; 695,301 triples from Reactome; 175,160 triples from BioCyc and 352,793 triples from KEGG. The Oracle 10 g database management system was used to store and query the triples.

Results
The queries could easily identify hub genes, i.e., those genes whose gene products participate in many pathways or interact with many other gene products.

Reference: http://dx.doi.org/10.1016/j.jbi.2008.02.006

Cheers,
Satya

http://knoesis.wright.edu/researchers/satya

----- Original Message -----
From: "Ezzat, Ahmed" <Ahmed.Ezzat@hp.com>
Date: Friday, December 19, 2008 5:16 pm
Subject: Re: RDB2RDF Usecase
To: "public-xg-rdb2rdf@w3.org" <public-xg-rdb2rdf@w3.org>
>
> Hello,
>
> One observation I have is we need to be clearer on Rdb2Rdf for solving the silo pain.  Rdb2Rdf is a must but not sufficient technology to integrate silos.  As you need ot reconcile the results from each data source together before the data is useful enough to apply SPARQL as an example; which is outside the Rdb2Rdf framework.
>
> Regarding user scenario, I see a lot of value in the Enterprise Information Management (EIM) area where you integrate data warehouse with content in the enterprise (i.e., not using current technology of NLP + converting to XML then shredding elements in the data warehouse database columns) to be able to return more actionable information.   For example, a query to a datawarehouse today can be" "tell me all companies that bought $1M equipments last month" <-- easy one.  Now with integration of structured and unstructured data in the enterprise you can ask " tell me all companies that bought $1M equipments and had complaints?"  The point here is customer complaints typically is in email content and the list of companies who bough is in the data warehouse.  By being able to integrate the results of search and SQL at high-level as RDF sub-graphs, etc, you can answer the 2nd question transparently w/o manual work.
>
> In summary, I suggest to position Rdb2Rdf as a core technology that would help in solving higher level problems like some of the examples in this email thread.
> Regards,
>
> Ahmed
>
>
>
> Ahmed K. Ezzat, Ph.D.
> HP Fellow, Business Intelligence Software Division
> Hewlett-Packard Corporation
> 11000 Wolf Road, Bldg 42 Upper, MS 4502, Cupertino, CA 95014-0691
> Office:      Email: Ahmed.Ezzat@hp.com<javascript:main.compose('new','t=Ahmed.Ezzat@hp.com')> Off: 408-447-6380  Fax: 1408796-5427  Cell: 408-504-2603
> Personal: Email: AhmedEzzat@aol.com<javascript:main.compose('new','t=AhmedEzzat@aol.com')> Tel: 408-253-5062  Fax:  408-253-6271
[cid:f1080df7-e70a-43d7-bc55-5ac98d28df55]
>
>
>

Received on Saturday, 20 December 2008 06:06:08 UTC