Re: Revised version of Proposed XG Recommendation from Li L Ma on 2009-01-06 (public-xg-rdb2rdf@w3.org from January 2009)

From: Li L Ma <malli@cn.ibm.com>
Date: Tue, 6 Jan 2009 11:37:36 +0800
To: ashok.malhotra@oracle.com
Cc: public-xg-rdb2rdf <public-xg-rdb2rdf@w3.org>, public-xg-rdb2rdf-request@w3.org
Message-ID: <OF81B72714.C168FB88-ON48257536.000CC68B-48257536.0013C725@cn.ibm.com>
Hi Ashok and all,

Happy New Year!

I have some comments on the following description in the recommendation 
report.
The final language specification should include guidance with regard to 
mapping Relational data to a subset of OWL such as OQL/QL or OWL/RL.

<ML>This is an interesting problem. I have some experiences in using 
SNOMED CT ontology and HL7 RIM ontology to access and query clinical data 
stored in relational databases. We used D2RQ language to create a mapping 
between RIM ontology and clinical data. The clinical data is already 
manually annotated by terms defined in SNOMED CT ontology (observation 
values are taken from SNOMED CT), so we do not need to build another 
mapping between SNOMED CT and clinical data. The expressivity of SNOMED CT 
and RIM used in our project is OWL EL and RDFS, respectively. The 
relational data model of clinical data is derived from HL7 RIM and has a 
natural match with RIM ontology. So, it is not hard to build a 
RIM2Clinical_data mapping. If clinical data is NOT annotated by SNOMED CT 
by human, I think it is very hard and time-consuming to map the data to 
SNOMED CT, which defines standard domain vocabulary/terms, using a mapping 
language, such as D2RQ. The difficulty is to need to specify value 
correspondency one by one (say that an observation value stored in a 
relational cell is indeed equivalent to a term in SNOMED CT). So far, I 
did not see other mapping problems caused by the expressivity of domain 
ontology similar to SNOMED CT. 

For RIM ontology, we did find limitations of D2RQ. Clinical observations 
are often stored in such a relational table, Observation(INT entity, INT 
property, INT value). Obviously, it is almost the same as a triple table. 
As you know, very often, a table (column) is mapped to a class (property). 
But to publish clinical observations, we have extended D2RQ mapping 
language by adding more constructs. The extension is also useful to other 
similar scenarios, such as product information management which uses a 
table of three columns (product, attribute, value) to store various 
product data. 

BTW, there may be a typo in your description. "OQL/QL" should be "OWL QL". 

</ML>

Best Regards,

Li MA, Ph.D
Manager, Semantic Technologies
IBM China Research Lab
TEL:   86-10-58748078 
T/L:   11905 ext. 8078
FAX:   86-10-58748731
E-Mail:   MaLLi@cn.ibm.com
Homepage: http://www.research.ibm.com/people/m/mali



ashok malhotra <ashok.malhotra@oracle.com> 
Sent by: public-xg-rdb2rdf-request@w3.org
2009-01-03 01:16
Please respond to
ashok.malhotra@oracle.com


To
public-xg-rdb2rdf <public-xg-rdb2rdf@w3.org>
cc

Subject
Revised version of Proposed XG Recommendation






I did some work on this over the holidays which included adding the two 
usecases.
I'll be happy to add more usecases if folks want to write them.

Please take a careful look.  We are getting close to the deadline so you 
close scrutiny is important.
-- 
All the best, Ashok

W3C RDB2RDF Incubator Group Report
01 January 2009
This version:
http://www.w3.org/XG_Recommendation/2009/RDB2RDF_XG-20090101 
Latest version:
http://www.w3.org/XG_Recommendation/RDB2RDF_XG 
Author:
Ashok Malhotra (editor), Oracle
Copyright © 2008 W3C. All rights reserved. This document is available 
under the W3C Document License. See the W3C Intellectual Rights Notice and 
Legal Disclaimers for additional information. 

Abstract
This is the final recommendation from the RDB2RDF XG. The XG recommends 
that the W3C initiate a WG to standardize a language for mapping 
Relational Database schemas into RDF and OWL. 
Status of this Document
This section describes the status of this document at the time of its 
publication. Other documents may supersede this document. A list of 
current W3C publications can be found in the W3C technical reports index 
at http://www.w3.org/TR/. 
This is the final recommendation from the RDB2RDF XG.
Table of Contents
1 Recommendation
    1.1 Usecases 
        1.1.1 Integrating Databases to Research Nicotine Dependency
        1.1.2 Triplify: Exposing Relational Data on the Web
    1.2 Liaisons
    1.3 Starting Points
2 References

1 Recommendation
The RDB2RDF XG recommends that the W3C initiate a WG to standardize a 
language for mapping Relational Database schemas into RDF and OWL. Such a 
standard will enable the vast amounts of data stored in Relational 
databases to be published easily and conveniently on the Web. It will also 
facilitate integrating data from separate Relational databases and adding 
semantics to Relational data.
This recommendation is based on the a survey of the State Of the Art 
conducted by the XG [StateOfArt] as well as the usecases discussed below.
The mapping language should be complete regarding when compared to to the 
relational algebra. It should have a human-readable syntax as well as XML 
and RDF representations of the syntax for purposes of discovery and 
machine generation.
There is a strong suggestion that the mapping language be expressed in 
rules as defined by the W3C [RIF] WG. The syntax does not have to follow 
the [RIF] syntax but should be isomorphic to it. The output of the mapping 
should be defined in terms of an RDFS/OWL schema.
It should be possible to subset the language for simple applications such 
as Web 2.0. This feature of the language will be validated by creating a 
library of mappings for widely used apps such as Drupal, Wordpress, phpBB.
[Michael Haussenblas will help with creating test cases].
The mapping language will allow customization with regard to names and 
data transformation. In addition, the language must be able to expose 
vendor specific SQL features such as full-text and spatial support and 
vendor-defined datatypes.
The final language specification should include guidance with regard to 
mapping Relational data to a subset of OWL such as OQL/QL or OWL/RL.
The language must allow for a mechanism to create identifiers for database 
entities. The generation of identifiers should be designed to support the 
implementation of the linked data principles [LinkedData]. Where possible, 
the language will encourage the reuse of public identifiers for long-lived 
entities such as persons, corporations, goe-locations, etc. See below.
1.1 Usecases 
To bootstrap exploitation of the Web as a globally accessible linked 
database, we need a few essentials:
Web accessible data increases in granularity and cross linkage.
Web applications and solutions produce structured interlinked data as 
extensions of existing functionality.
Web users are shielded from the underlying complexity of injecting 
structured linked data into the Web.
1.1.1 Integrating Databases to Research Nicotine Dependency
Complex biological queries generally require the integration of 
information from several sources. To understand the genetic basis of 
nicotine dependence, we needed to integrate gene and pathway information 
and answer three complex biological queries using the integrated knowledge 
base. The gene information source NCBI Entrez Gene, which has gene-related 
records of ~2 million genes needed to be integrated with pathway 
information sources, such as KEGG (Kyoto Encyclopedia for Genes and 
Genomics). Comparing results across model organisms requires homology 
information provided by the NCBI HomoloGene, containing homology data for 
several completely sequenced eukaryotic organisms).
We used an ontology-driven approach to integrate the two gene resources 
Entrez Gene and HomoloGene) and three pathway resources KEGG, Reactome and 
BioCyc. We created the Entrez Knowledge Model (EKoM), an information model 
in OWL for the gene resources, and integrated it with the extant BioPAX 
ontology designed for pathway resources. The integrated schema was 
populated with data from the pathway resources, publicly available in 
BioPAX-compatible format, and gene resources for which a population 
procedure was created. 
SPARQL was used to formulate queries to investigate the genetic basis of 
nicotine dependence over the integrated knowledge base: 
Which genes participate in a large number of pathways?
Identify "hub genes" from the perspective of gene interaction?
Which genes are expressed in the brain, in the context of neurobiology of 
nicotine dependence and various neurotransmitters in the central nervous 
system?
We found that the queries could easily identify hub genes, i.e., those 
genes whose gene products participate in many pathways or interact with 
many other gene products. See [NicotineDependence] for details.
1.1.2 Triplify: Exposing Relational Data on the Web
In order to make the Semantic Web useful to ordinary Web users, RDF and 
OWL have to be deployed on the Web on a much larger scale. Web 
applications such as Content Management Systems, online shops or community 
applications (e.g. Wikis, Blogs, Fora) already store their data in 
relational databases [triplify]. Providing a standardized way to map the 
relational data structures behind these Web applications into RDF, 
RDF-Schema and OWL will facilitate broad penetration and enrich the Web 
with RDF data and ontologies and facilitate novel semantic browsing and 
search applications.
By supporting the long tail of Web applications and thus counteracting the 
centralization of the Web 2.0 applications the planned RDB2RDF 
standardization will help to give control over data back to end-users and 
thus promote a democratization of the Web.
To support this usecase scenario, the mapping language should be easily 
implementable for lightweight Web applications and have a shallow learning 
curve to foster early adoption by Web developers.
1.2 Liaisons
The WG must track the evolution of SPARQL and liaise with the DAWG WG as 
well as the OWL WG. The proposed WG will also keep track of work on 
assigning unique identifiers to well-known entities such as the ENS system 
associated with the OKKAM project [OKKAM] and the Common Naming Project 
started by Neuro Commons [Common Naming Project]
1.3 Starting Points
The WG will take as its starting point the mapping languages developed by 
the [D2RQ] and [Virtuoso] efforts.
2 References
Common Naming Project
Neuro Commons Common Naming Project , Science Commons, Sept 17, 2008. (See 
http://neurocommons.org/page/Common_Naming_Project.)
D2RQ
The D2RQ Platform v0.5.1, User Manual and Language Specification , Chris 
Bizer, Richard Cyganiak, Jorg Garbers, Oliver Maresch (See 
http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/.)
RIF
W3C Rule Interchange Format Working Group (See 
http://www.w3.org/2005/rules/wiki/RIF_Working_Group.)
LinkedData
Design Issues for Linked Data, Tim Berners-Lee (See 
http://www.w3.org/DesignIssues/LinkedData.html.)
StateOfArt
Mapping Relational Data to RDF and OWL: A Literature Survey, Satya Sahoo, 
Wolfgang Halb (See http://esw.w3.org/topic/Rdb2RdfXG/.)
OKKAM
An Entity Name System (ENS) for the Semantic Web, Paolo Bouquet, Heiko 
Stoermer, Barbara Bazzanella, January 2008. (See http://www.okkam.org/.)
Virtuoso
Virtuoso Open-Source Edition (See 
http://virtuoso.openlinksw.com/wiki/main/Main/.)
Triplify
Triplify - Lightweight Linked Data Publication from Relational Databases, 
submitted to WWW 2009 Auer, Dietzold, Lehmann, Hellmann, Aumueller (See 
http://www.informatik.uni-leipzig.de/~auer/publication/triplify.pdf.)
NicoteneDependence
An ontology-driven semantic mashup of gene and biological pathway 
information: Application to the domain of nicotine dependence Satya S. 
Sahoo, Olivier Bodenreider, Joni L. Rutter, Karen J. Skinner and Amit P. 
Shetha (See http://dx.doi.org/10.1016/j.jbi.2008.02.006 .)
Attachments

image/gif attachment: 01-part
image/gif attachment: 02-part
Received on Tuesday, 6 January 2009 03:41:04 UTC