W3C RDB2RDF Incubator Group Report

1 Recommendation

The RDB2RDF XG recommends that the W3C initiate a WG to standardize a language for mapping Relational Database schemas into RDF and OWL. Such a standard will enable the vast amounts of data stored in Relational databases to be published easily and conveniently on the Web. It will also facilitate integrating data from separate Relational databases and adding semantics to Relational data.

This recommendation is based on the a survey of the State Of the Art conducted by the XG [StateOfArt] as well as the usecases discussed below.

The mapping language should be complete regarding when compared to to the relational algebra. It should have a human-readable syntax as well as XML and RDF representations of the syntax for purposes of discovery and machine generation.

There is a strong suggestion that the mapping language be expressed in rules as defined by the W3C [RIF] WG. The syntax does not have to follow the [RIF] syntax but should be isomorphic to it. The output of the mapping should be defined in terms of an RDFS/OWL schema.

It should be possible to subset the language for simple applications such as Web 2.0. This feature of the language will be validated by creating a library of mappings for widely used apps such as Drupal, Wordpress, phpBB.

[Michael Haussenblas will help with creating test cases].

The mapping language will allow customization with regard to names and data transformation. In addition, the language must be able to expose vendor specific SQL features such as full-text and spatial support and vendor-defined datatypes.

The final language specification should include guidance with regard to mapping Relational data to a subset of OWL such as OQL/QL or OWL/RL.

The language must allow for a mechanism to create identifiers for database entities. The generation of identifiers should be designed to support the implementation of the linked data principles [LinkedData]. Where possible, the language will encourage the reuse of public identifiers for long-lived entities such as persons, corporations, goe-locations, etc. See below.

1.1 Usecases

To bootstrap exploitation of the Web as a globally accessible linked database, we need a few essentials:

Web accessible data increases in granularity and cross linkage.
Web applications and solutions produce structured interlinked data as extensions of existing functionality.
Web users are shielded from the underlying complexity of injecting structured linked data into the Web.

1.1.1 Integrating Databases to Research Nicotine Dependency

Complex biological queries generally require the integration of information from several sources. To understand the genetic basis of nicotine dependence, we needed to integrate gene and pathway information and answer three complex biological queries using the integrated knowledge base. The gene information source NCBI Entrez Gene, which has gene-related records of ~2 million genes needed to be integrated with pathway information sources, such as KEGG (Kyoto Encyclopedia for Genes and Genomics). Comparing results across model organisms requires homology information provided by the NCBI HomoloGene, containing homology data for several completely sequenced eukaryotic organisms).

We used an ontology-driven approach to integrate the two gene resources Entrez Gene and HomoloGene) and three pathway resources KEGG, Reactome and BioCyc. We created the Entrez Knowledge Model (EKoM), an information model in OWL for the gene resources, and integrated it with the extant BioPAX ontology designed for pathway resources. The integrated schema was populated with data from the pathway resources, publicly available in BioPAX-compatible format, and gene resources for which a population procedure was created.

SPARQL was used to formulate queries to investigate the genetic basis of nicotine dependence over the integrated knowledge base:

Which genes participate in a large number of pathways?
Identify "hub genes" from the perspective of gene interaction?
Which genes are expressed in the brain, in the context of neurobiology of nicotine dependence and various neurotransmitters in the central nervous system?

We found that the queries could easily identify hub genes, i.e., those genes whose gene products participate in many pathways or interact with many other gene products. See [NicotineDependence] for details.

1.1.2 Triplify: Exposing Relational Data on the Web

In order to make the Semantic Web useful to ordinary Web users, RDF and OWL have to be deployed on the Web on a much larger scale. Web applications such as Content Management Systems, online shops or community applications (e.g. Wikis, Blogs, Fora) already store their data in relational databases [triplify]. Providing a standardized way to map the relational data structures behind these Web applications into RDF, RDF-Schema and OWL will facilitate broad penetration and enrich the Web with RDF data and ontologies and facilitate novel semantic browsing and search applications.

By supporting the long tail of Web applications and thus counteracting the centralization of the Web 2.0 applications the planned RDB2RDF standardization will help to give control over data back to end-users and thus promote a democratization of the Web.

To support this usecase scenario, the mapping language should be easily implementable for lightweight Web applications and have a shallow learning curve to foster early adoption by Web developers.

1.2 Liaisons

The WG must track the evolution of SPARQL and liaise with the DAWG WG as well as the OWL WG. The proposed WG will also keep track of work on assigning unique identifiers to well-known entities such as the ENS system associated with the OKKAM project [OKKAM] and the Common Naming Project started by Neuro Commons [Common Naming Project]

1.3 Starting Points

The WG will take as its starting point the mapping languages developed by the [D2RQ] and [Virtuoso] efforts.

2 References

Common Naming Project: Neuro Commons Common Naming Project , Science Commons, Sept 17, 2008. (See http://neurocommons.org/page/Common_Naming_Project.)
D2RQ: The D2RQ Platform v0.5.1, User Manual and Language Specification , Chris Bizer, Richard Cyganiak, Jorg Garbers, Oliver Maresch (See http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/.)
RIF: W3C Rule Interchange Format Working Group (See http://www.w3.org/2005/rules/wiki/RIF_Working_Group.)
LinkedData: Design Issues for Linked Data, Tim Berners-Lee (See http://www.w3.org/DesignIssues/LinkedData.html.)
StateOfArt: Mapping Relational Data to RDF and OWL: A Literature Survey, Satya Sahoo, Wolfgang Halb (See http://esw.w3.org/topic/Rdb2RdfXG/.)
OKKAM: An Entity Name System (ENS) for the Semantic Web, Paolo Bouquet, Heiko Stoermer, Barbara Bazzanella, January 2008. (See http://www.okkam.org/.)
Virtuoso: Virtuoso Open-Source Edition (See http://virtuoso.openlinksw.com/wiki/main/Main/.)
Triplify: Triplify - Lightweight Linked Data Publication from Relational Databases, submitted to WWW 2009 Auer, Dietzold, Lehmann, Hellmann, Aumueller (See http://www.informatik.uni-leipzig.de/~auer/publication/triplify.pdf.)
NicoteneDependence: An ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependence Satya S. Sahoo, Olivier Bodenreider, Joni L. Rutter, Karen J. Skinner and Amit P. Shetha (See http://dx.doi.org/10.1016/j.jbi.2008.02.006 .)

W3C RDB2RDF Incubator Group Report

01 January 2009

Abstract

Status of this Document

Table of Contents