Re: Deliverables from the RDB2RDF XG

Hi Ivan:
You said ...

- Would you or one of your colleagues be ready to come (again:-) to one
of our next SW Coordination Group meeting to give a bit of a report and
discuss the possible followups? 

Where is the meeting?

All the best, Ashok

Ivan Herman wrote:
> Hi Ashok,
> First of all, thanks! I have actually two questions, none of those are
> closely related to your original question (that Mauro already answered,
> I believe:-)
> - Would you or one of your colleagues be ready to come (again:-) to one
> of our next SW Coordination Group meeting to give a bit of a report and
> discuss the possible followups? The best date would be the 20th of
> February, Friday, at 16:00 Amsterdam time (I guess 10:00 Boston time)?
> - Did the group thought of also preparing a rough draft charter for the
> group you propose? It would make things easier to discuss both
> internally and externally. There can be many empty slots in the charter
> but it would give an idea to move forward. It would also give a feeling
> on who would/could staff such a group.
> Thanks again!
> Cheers
> Ivan
> ashok malhotra wrote:
>> Ivan, Mauro:
>> As you know, the RDB2RDF XG is coming to a close.  We are planning two
>> deliverables and I thought I would run them by you for early comments.
>> 1. We have prepared a final report.  This is attached.  I am trying to
>> get permission to put it on the W3C site.
>> 2. We have prepared a State Of the Art Survey.  This is in the form of
>> extensions to the ESW  Wiki
>> or as a PDF file
>> <>.
>> both have the same content.  Is this format acceptable for an XG
>> deliverable?
>> ------------------------------------------------------------------------
>> W3C <>W3C Incubator Report
>> <>
>>   W3C RDB2RDF Incubator Group Report
>>     16 January 2009
>> This version:
>> Latest version:
>> XG_Report/RDB2RDF_XG
>>     <> 
>> Previous version:
>>     This is the first public version. 
>> Author:
>>     Ashok Malhotra (editor), Oracle
>> Copyright © 2008 W3C <>. All rights reserved. This
>> document is available under the W3 C Document License
>> <>.
>> See the W 3C Intellectual Rights Notice and Legal Disclaimers
>> <>
>> for additional information.
>> ------------------------------------------------------------------------
>>     Abstract
>> This is the final report from the RDB2RDF XG. The XG recommends that the
>> W3C initiate a WG to standardize a language for mapping Relational
>> Database schemas into RDF and OWL.
>>     Status of this Document
>> /This section describes the status of this document at the time of its
>> publication. Other documents may supersede this document. A list of
>> current W3C publications can be found in the W3C technical reports index
>> <> at
>> This is the final recommendation from the RDB2RDF XG.
>>     Table of Contents
>> 1 Recommendation <#recommendation>
>>     1.1 Usecases <#usecases>
>>         1.1.1 Integrating Databases to Research Nicotine Dependency
>> <#biomedical>
>>         1.1.2 Triplify: Exposing Relational Data on the Web <#Triplify>
>>         1.1.3 Integration of Enterprise Information Systems <#enterprise>
>>         1.1.4 Ordnance Survey Use Case <#ordnance>
>>     1.2 Liaisons <#liaisons>
>>     1.3 Starting Points <#IDA2UIP>
>> 2 References <#References>
>> ------------------------------------------------------------------------
>>     1 Recommendation
>> The RDB2RDF XG recommends that the W3C initiate a Working Group (WG) to
>> standardize a language for mapping Relational Database schemas into RDF
>> and OWL. Such a standard will enable the vast amounts of data stored in
>> Relational databases to be published easily and conveniently on the Web.
>> It will also facilitate integrating data from separate Relational
>> databases and adding semantics to Relational data.
>> This recommendation is based on the a survey of the State Of the Art
>> conducted by the XG [StateOfArt] <#StateOfArt> as well as the usecases
>> discussed below.
>> The mapping language defined by the WG would facilitate the development
>> of several types of products. It could be used to translate Relational
>> data into RDF which could be stored in a triple store. This is sometimes
>> called Extract-Transform-Load (ETL). Or it could be used to generate a
>> virtual mapping that could be queried using SPARQL and the SPARQL
>> translated to SQL queries on the underlying Relational data. Other
>> products could be layered on top of these capabilities to query and
>> deliver data in different ways as well as to integrate the data with
>> other kinds of information on the Semantic Web.
>> The mapping language should be complete regarding when compared to to
>> the relational algebra. It should have a human-readable syntax as well
>> as XML and RDF representations of the syntax for purposes of discovery
>> and machine generation.
>> There is a strong suggestion that the mapping language be expressed in
>> rules as defined by the W3C [RIF] <#RIF> WG. The syntax does not have to
>> follow the RIF syntax but there should a round-trippable mapping between
>> mapping language and a RIF dialect. The output of the mapping should be
>> defined in terms of an RDFS/OWL schema.
>> It should be possible to subset the language for simple applications
>> such as Web 2.0. This feature of the language will be validated by
>> creating a library of mappings for widely used apps such as Drupal,
>> Wordpress, phpBB.
>> The mapping language will allow customization with regard to names and
>> data transformation. In addition, the language must be able to expose
>> vendor specific SQL features such as full-text and spatial support and
>> vendor-defined datatypes.
>> The final language specification should include guidance with regard to
>> mapping Relational data to a subset of OWL such as OWL/QL or OWL/RL.
>> The language must allow for a mechanism to create identifiers for
>> database entities. The generation of identifiers should be designed to
>> support the implementation of the linked data principles [LinkedData]
>> <#LinkedData>. Where possible, the language will encourage the reuse of
>> public identifiers for long-lived entities such as persons,
>> corporations, geo-locations, etc. See *1.2 Liaisons* <#liaisons>.
>> The proposed Working Group will also create a set of test cases that
>> could be used to verify conformance.
>>       1.1 Usecases
>> To bootstrap exploitation of the Web as a globally accessible linked
>> database, we need a few essentials:
>>     * Web accessible data needs to increase in granularity and cross
>>       linkage.
>>     * Web applications and solutions must produce structured interlinked
>>       data as extensions of existing functionality.
>>     * Web users must be shielded from the underlying complexity of
>>       injecting structured linked data into the Web.
>>         1.1.1 Integrating Databases to Research Nicotine Dependency
>> Complex biological queries generally require the integration of
>> information from several sources. To understand the genetic basis of
>> nicotine dependence, gene and pathway information needed to be
>> integrated and three complex biological queries answered using the
>> integrated knowledge base. The gene information source NCBI Entrez Gene,
>> which has gene-related records of ~2 million genes needed to be
>> integrated with pathway information sources, such as KEGG (Kyoto
>> Encyclopedia for Genes and Genomics). Comparing results across model
>> organisms required homology information provided by the NCBI HomoloGene,
>> containing homology data for several completely sequenced eukaryotic
>> organisms).
>> An ontology-driven approach was used to integrate the two gene resources
>> (Entrez Gene and HomoloGene) and the three pathway resources (KEGG,
>> Reactome and BioCyc). An OWL ontology called the Entrez Knowledge Model
>> (EKoM) was created for the gene resources and integrated with the extant
>> BioPAX ontology designed for pathway resources. The integrated schema
>> was populated with data from the pathway resources, publicly available
>> in BioPAX-compatible format, and gene resources for which a population
>> procedure was created.
>> SPARQL was used to formulate queries to investigate the genetic basis of
>> nicotine dependence over the integrated knowledge base:
>>     * Which genes participate in a large number of pathways?
>>     * Identify "hub genes" from the perspective of gene interaction?
>>     * Which genes are expressed in the brain, in the context of
>>       neurobiology of nicotine dependence and various neurotransmitters
>>       in the central nervous system?
>> The result was very successful. The queries could easily identify hub
>> genes, i.e., those genes whose gene products participate in many
>> pathways or interact with many other gene products. See
>> [NicotineDependence] <#> for details.
>>         1.1.2 Triplify: Exposing Relational Data on the Web
>> In order to make the Semantic Web useful to ordinary Web users, RDF and
>> OWL have to be deployed on the Web on a much larger scale. Web
>> applications such as Content Management Systems, online shops or
>> community applications (e.g. Wikis, Blogs, Fora) already store their
>> data in relational databases [Triplify] <#TriplifyPaper>. Providing a
>> standardized way to map the relational data structures behind these Web
>> applications into RDF, RDF-Schema and OWL will facilitate broad
>> penetration and enrich the Web with RDF data and ontologies and
>> facilitate novel semantic browsing and search applications.
>> By supporting the long tail of Web applications and thus counteracting
>> the centralization of the Web 2.0 applications the planned RDB2RDF
>> standardization will help to give control over data back to end-users
>> and thus promote a democratization of the Web.
>> To support this usecase scenario, the mapping language should be easily
>> implementable for lightweight Web applications and have a shallow
>> learning curve to foster early adoption by Web developers.
>>         1.1.3 Integration of Enterprise Information Systems
>> Efficient information and data exchange between application systems
>> within and across enterprises is of paramount importance in the
>> increasingly networked and IT-dominated business atmosphere. Existing
>> Enterprise Information Systems such as CRM, CMS and ERP systems use
>> Relational database backends for persistence. RDF and Linked Data can
>> provide data exchange and integration interfaces for such application
>> systems, which are easy to implement and use, especially in settings
>> where a loose and flexible coupling of the systems is required.
>> Insight can often be gained by integrating data from databses built for
>> different purposes in separate corporate silos. For example, integrating
>> data from a bug database with a customer database may help understand
>> ordering behavior as a function of the bugs encountered.
>> In Supply Chain Management (SCM), for example, it is vital to exchange
>> product catalogs and other goods related information within a network of
>> interconnected businesses involved in the ultimate provision of product
>> and service packages. Such information is stored in relational databases
>> and sometimes already exchanged electronically, but a variety of
>> different technologies are used (e.g. proprietary files, XML files, DB
>> dumps, Web Services etc.). Realizing a completely electronic information
>> flow requires significant initial investments and currently limits the
>> flexibility of businesses (e.g. with regard to changes in business
>> partners). The envisioned RDB2RDF mapping language applied in
>> conjunction with existing RDB based SCM systems will support the use of
>> RDF and unique identifiers for realizing flexible information
>> information flows accompanying supply chains.
>> The mapping language to be standardized by the proposed WG will simplify
>> the publishing of enterprise data and information from Relational data
>> backends and, thus, facilitate the interlinking and exchange of
>> information between business information systems. In this scenario
>> on-demand transformation of relational data to RDF, scalability and
>> completeness with regard to the relational algebra are central
>> requirements.
>>         1.1.4 Ordnance Survey Use Case
>> Ordnance Survey, the National mapping agency of the UK, operates a very
>> large geographical information system based on Oracle Spatial. The
>> database contains topographical features, soil type and land use
>> information. All these types of information are independently maintained
>> and use separate terminologies. They describe the same land area but the
>> boundaries of objects utilized for representing land use and soil type
>> and topography do not coincide: For example, a pasture might consist of
>> two distinct types of soil.
>> An example of a need to integrate this information is modeling
>> filtration of pollutants into water bodies from agricultural land. The
>> soil type determines the degree of filtration, the land use determines
>> the type of pollutant. Topography determines whether the field is next
>> to a water body.
>> An ontology exists for describing the types of objects in each database.
>> The benefit from mapping the data to RDF is in simplifying querying and
>> integration of the data. The very high volume of data makes an ETL
>> approach impracticable, besides, the Oracle Spatial database offers
>> spatial joining which is generally not available on RDF stores.
>> Thus, it is necessary to take SPARQL queries expressed in terms of the
>> land use, soil type and topography ontologies and convert them into
>> single SQL statements, with all joining and filtering to take place at
>> the relational database. In the process, high level concepts need to be
>> translated into SQL conditions on data that is not readily human readable.
>> Business questions to be answered by the use case are for example:
>>     * What is the total length of river bank bordered by permeable soil
>>       used for grazing along a certain river?
>>     * What types of crops are being cultivated within 100m of water,
>>       with total land use grouped by crop.
>>     * What watter bodies are subject to high environmental load from
>>       agriculture, as defined by little current and extensive use of
>>       adjacent land.
>> From the viewpoint of RDB to RDF mapping, this usecase highlights the
>> need to integrate data from different databases, built for different
>> purposes. It also emphasizes need for extensibility in the mapping
>> language for supporting RDBMS vendor specific features. In the present
>> case, Oracle expresses a spatial join using a special type of derived
>> table not found in standard SQL, thus the customization need is deeper
>> than just supporting calls to native SQL functions.
>> The inference requirement consists primarily of expanding class
>> membership into and's and or's of conditions on the relational data. In
>> some cases, these conditions are spatial, such as bordering on or
>> contained in. The user should be familiar with the ontologies but should
>> not have to know about the classification codes used in the databases.
>>       1.2 Liaisons
>> The WG must track the evolution of SPARQL and liaise with the DAWG WG as
>> well as the OWL WG. The proposed WG will also keep track of work on
>> assigning unique identifiers to well-known entities such as the ENS
>> system associated with the OKKAM project [OKKAM] <#okkam> and the Common
>> Naming Project started by Neuro Commons [Common Naming Project]
>> <#CommonNaming>
>>       1.3 Starting Points
>> The WG will take as its starting point the mapping languages developed
>> by the [D2RQ] <#D2RQ> and [Virtuoso] <#Virtuoso> efforts.
>>     2 References
>> Common Naming Project
>>     Neuro Commons Common Naming Project
>>     <>, Science
>>     Commons, Sept 17, 2008. (See
>> D2RQ
>>     The D2RQ Platform v0.5.1, User Manual and Language Specification
>>     <>, Chris Bizer,
>>     Richard Cyganiak, Jorg Garbers, Oliver Maresch (See
>> RIF
>>     W3C Rule Interchange Format Working Group
>>     <> (See
>> LinkedData
>>     Design Issues for Linked Data
>>     <>, Tim Berners-Lee
>>     (See
>> StateOfArt
>>     Mapping Relational Data to RDF and OWL: A Literature Survey
>>     <>, Satya Sahoo, Wolfgang Halb
>>     (See
>>     An Entity Name System (ENS) for the Semantic Web
>>     <>, Paolo Bouquet, Heiko Stoermer, Barbara
>>     Bazzanella, January 2008. (See
>> Virtuoso
>>     Virtuoso Open-Source Edition
>>     <> (See
>> Triplify
>>     Triplify - Lightweight Linked Data Publication from Relational
>>     Databases, submitted to WWW 2009
>>     <>Auer,
>>     Dietzold, Lehmann, Hellmann, Aumueller (See
>> NicoteneDependence
>>     An ontology-driven semantic mashup of gene and biological pathway
>>     information: Application to the domain of nicotine dependence
>>     < >Satya S. Sahoo,
>>     Olivier Bodenreider, Joni L. Rutter, Karen J. Skinner and Amit P.
>>     Shetha (See .)

Received on Monday, 26 January 2009 15:59:46 UTC