Re: Revised version of final XR Report from Michael Hausenblas on 2009-01-13 (public-xg-rdb2rdf@w3.org from January 2009)

From: Michael Hausenblas <michael.hausenblas@deri.org>
Date: Tue, 13 Jan 2009 17:10:26 +0000
To: <ashok.malhotra@oracle.com>
CC: public-xg-rdb2rdf <public-xg-rdb2rdf@w3.org>
Message-ID: <C5927B82.FB5%michael.hausenblas@deri.org>
All,

FYI: I've now put our XGR at the correct location [1].

Cheers,
Michael


[1] http://www.w3.org/2005/Incubator/rdb2rdf/XGR/

-- 
Dr. Michael Hausenblas
DERI - Digital Enterprise Research Institute
National University of Ireland, Lower Dangan,
Galway, Ireland, Europe
Tel. +353 91 495730
http://sw-app.org/about.html


> From: ashok malhotra <ashok.malhotra@oracle.com>
> Organization: Oracle
> Reply-To: <ashok.malhotra@oracle.com>
> Date: Mon, 12 Jan 2009 13:20:33 -0800
> To: public-xg-rdb2rdf <public-xg-rdb2rdf@w3.org>
> Subject: Revised version of final XR Report
> Resent-From: <public-xg-rdb2rdf@w3.org>
> Resent-Date: Mon, 12 Jan 2009 21:22:10 +0000
> 
> See attached.
> This has two additional usecases and some other changes as discussed on
> last telcon.
> 
> Thanks to Michael Hausenblas for cleaning up the XML source.
> -- 
> All the best, Ashok
>  <http://www.w3.org/>  <http://www.w3.org/2005/Incubator/XGR/>
> W3C RDB2RDF Incubator Group Report
>  12 January 2009
> This version: http://www.w3.org/XG_Report/2009/RDB2RDF_XG-20090112 Latest
> version: http://www.w3.org/ XG_Report/RDB2RDF_XG
> <http://www.w3.org/XG_Report/RDB2RDF_XG>  Author:Ashok Malhotra (editor),
> Oracle Copyright © 2008 W3C <http://www.w3c.org> . All rights reserved. This
> document is available under the W3 C Document License
> <http://www.w3.org/Consortium/Legal/2002/copyright-documents-20021231> . See
> the W 3C Intellectual Rights Notice and Legal Disclaimers
> <http://www.w3.org/Consortium/Legal/2002/ipr-notice-20021231#Copyright>  for
> additional information.
> 
> Abstract
>  This is the final report from the RDB2RDF XG. The XG recommends that the W3C
> initiate a WG to standardize a language for mapping Relational Database
> schemas into RDF and OWL.
> 
> Status of this Document
>  This section describes the status of this document at the time of its
> publication. Other documents may supersede this document. A list of current
> W3C publications can be found in the  W3C technical reports index
> <http://www.w3.org/TR/>  at http://www.w3.org/TR/.
> 
> This is the final recommendation from the RDB2RDF XG.
> 
> Table of Contents
> 1 Recommendation <#recommendation>
>     1.1 Usecases  <#usecases>
>         1.1.1 Integrating Databases to Research Nicotine Dependency
> <#biomedical> 
>         1.1.2 Triplify: Exposing Relational Data on the Web <#triplify>
>         1.1.3 Integration of Enterprise Information Systems <#enterprise>
>         1.1.4 Ordnance Survey Use Case <#ordnance>
>     1.2 Liaisons <#liaisons>
>     1.3 Starting Points <#IDA5UIP>
> 2 References <#References>
> 
> 1 Recommendation
>  The RDB2RDF XG recommends that the W3C initiate a Working Group (WG) to
> standardize a language for mapping Relational Database schemas into RDF and
> OWL.  Such a standard will enable the vast amounts of data stored in
> Relational databases to be published easily and conveniently on the Web.  It
> will also facilitate integrating data from separate Relational databases and
> adding semantics to Relational data.
> 
> This recommendation is based on the a survey of the State Of the Art conducted
> by the XG [StateOfArt] <#StateOfArt>  as well as the usecases discussed below.
> 
> The mapping language defined by the WG would facilitate the development of
> several types of products.  It could be used to translate Relational data into
> RDF which could be stored in a triple store.  This is sometimes called
> Extract-Transform-Load (ETL). Or it could be used to generate a virtual
> mapping that could be queried using SPARQL and the SPARQL translated to SQL
> queries on the underlying Relational data.  Other products could be layered on
> top of these capabilities to query and deliver data in different ways as well
> as to integrate the data with other kinds of information on the Semantic Web.
> 
> The mapping language should be complete regarding when compared to to the
> relational algebra.  It should have a human-readable syntax as well as XML and
> RDF representations of the syntax for purposes of discovery and machine
> generation.
> 
> There is a strong suggestion that the mapping language be expressed in rules
> as defined by the W3C [RIF] <#RIF>  WG.  The syntax does not have to follow
> the [RIF] <#RIF>  syntax but should be isomorphic to it. The output of the
> mapping should be defined in terms of an RDFS/OWL schema.
> 
>  It should be possible to subset the language for simple applications such as
> Web 2.0. This feature of the language will be validated by creating a library
> of mappings for widely used apps such as Drupal, Wordpress, phpBB.
> 
>  The mapping language will allow customization with regard to names and data
> transformation.  In addition, the language must be able to expose vendor
> specific SQL features such as full-text and spatial support and vendor-defined
> datatypes.
> 
>  The final language specification should include guidance with regard to
> mapping Relational data to a subset of OWL such as OWL/QL or OWL/RL.
> 
>  The language must allow for a mechanism to create identifiers for database
> entities. The generation of identifiers should be designed to support the
> implementation of the linked data principles [LinkedData] <#LinkedData> .
> Where possible, the language will encourage the reuse of public identifiers
> for long-lived entities such as persons, corporations, geo-locations, etc.
> See 1.2 Liaisons <#liaisons> .
> 
>  The proposed Working Group will also create a set of test cases that could be
> used to verify conformance.
> 
> 1.1 Usecases 
> To bootstrap exploitation of the Web as a globally accessible linked database,
> we need a few essentials:
> * Web accessible data needs to increase in granularity and cross linkage.
> * Web applications and solutions must produce structured interlinked data as
> extensions of existing functionality.
> * Web users must be shielded from the underlying complexity of injecting
> structured linked data into the Web.
> 1.1.1 Integrating Databases to Research Nicotine Dependency
>  Complex biological queries generally require the integration of information
> from several sources. To understand the genetic basis of nicotine dependence,
> gene and pathway information needed to be integrated and three complex
> biological queries answered using the integrated knowledge base.   The gene
> information source NCBI Entrez Gene, which has gene-related records of ~2
> million genes needed to be integrated with pathway information sources, such
> as KEGG (Kyoto Encyclopedia for Genes and Genomics). Comparing results across
> model organisms required homology information provided by the NCBI HomoloGene,
> containing homology data for several completely sequenced eukaryotic
> organisms).
> 
>  An ontology-driven approach was used to integrate the two gene resources
> (Entrez Gene and HomoloGene) and the three pathway resources (KEGG, Reactome
> and BioCyc). An OWL ontology called the Entrez Knowledge Model (EKoM) was
> created for the gene resources and integrated with the extant BioPAX ontology
> designed for pathway resources. The integrated schema was populated with data
> from the pathway resources, publicly available in BioPAX-compatible format,
> and gene resources for which a population procedure was created.
> 
>  SPARQL was used to formulate queries to investigate the genetic basis of
> nicotine dependence over the integrated knowledge base:
> * Which genes participate in a large number of pathways?
> * Identify "hub genes" from the perspective of gene interaction?
> * Which genes are expressed in the brain, in the context of neurobiology of
> nicotine dependence and various neurotransmitters in the central nervous
> system?
>  The result was very successful.  The queries could easily identify hub genes,
> i.e., those genes whose gene products participate in many pathways or interact
> with many other gene products. See [NicotineDependence] <#>  for details.
> 
> 1.1.2 Triplify: Exposing Relational Data on the Web
> In order to make the Semantic Web useful to ordinary Web users, RDF and OWL
> have to be deployed on the Web on a much larger scale. Web applications such
> as Content Management Systems, online shops or community applications (e.g.
> Wikis, Blogs, Fora) already store their data in relational databases
> [triplify] <#triplify> . Providing a standardized way to map the relational
> data structures behind these Web applications into RDF, RDF-Schema and OWL
> will facilitate broad penetration and enrich the Web with RDF data and
> ontologies and facilitate novel semantic browsing and search applications.
> 
> By supporting the long tail of Web applications and thus counteracting the
> centralization of the Web 2.0 applications the planned RDB2RDF standardization
> will help to give control over data back to end-users and thus promote a
> democratization of the Web.
> 
> To support this usecase scenario, the mapping language should be easily
> implementable for lightweight Web applications and have a shallow learning
> curve to foster early adoption by Web developers.
> 
> 1.1.3 Integration of Enterprise Information Systems
>  Efficient information and data exchange between application systems within
> and across enterprises is of paramount importance in the increasingly
> networked and IT-dominated business atmosphere. Existing Enterprise
> Information Systems such as CRM, CMS and ERP systems use Relational database
> backends for persistence. RDF and Linked Data can provide data exchange and
> integration interfaces for such application systems, which are easy to
> implement and use, especially in settings where a loose and flexible coupling
> of the systems is required.
> 
> Insight can often be gained by integrating data from databses built for
> different purposes in separate corporate silos.  For example, integrating data
> from a bug database with a customer database may help understand ordering
> behavior as a function of the bugs encountered.
> 
>  In Supply Chain Management (SCM), for example, it is vital to exchange
> product catalogs and other goods related information within a network of
> interconnected businesses involved in the ultimate provision of product and
> service packages. Such information is stored in relational databases and
> sometimes already exchanged electronically, but a variety of different
> technologies are used (e.g. proprietary files, XML files, DB dumps, Web
> Services etc.). Realizing a completely electronic information flow requires
> significant initial investments and currently limits the flexibility of
> businesses (e.g. with regard to changes in business partners). The envisioned
> RDB2RDF mapping language applied in conjunction with existing RDB based SCM
> systems will support the use of RDF and unique identifiers for realizing
> flexible information information flows accompanying supply chains.
> 
>  The mapping language to be standardized by the proposed WG will simplify the
> publishing of enterprise data and information from Relational data backends
> and, thus, facilitate the interlinking and exchange of information between
> business information systems. In this scenario on-demand transformation of
> relational data to RDF, scalability and completeness with regard to the
> relational algebra are central requirements.
> 
> 1.1.4 Ordnance Survey Use Case
>  Ordnance Survey, the National mapping agency of the UK, operates a very large
> geographical information system based on Oracle Spatial. The database contains
> topographical features, soil type and land use information.  All these types
> of information are independently maintained and use separate terminologies.
> They describe the same land area but the boundaries of objects utilized for
> representing land use and soil type and topography do not coincide:  For
> example, a pasture might consist of two distinct types of soil.
> 
> An example of a need to integrate this information is modeling filtration of
> pollutants into water bodies from agricultural land. The soil type determines
> the degree of filtration, the land use determines the type of pollutant.
> Topography determines whether the field is next to a water body.
> 
> An ontology exists for describing the types of objects in each database. The
> benefit from mapping the data to RDF is in simplifying querying and
> integration of the data.  The very high volume of data makes an ETL approach
> impracticable, besides, the Oracle Spatial database offers spatial joining
> which is generally not available on RDF stores.
> 
>  Thus, it is necessary to take SPARQL queries expressed in terms of the land
> use, soil type and topography ontologies and convert them into single SQL
> statements, with all joining and filtering to take place at the relational
> database.  In the process, high level concepts need to be translated into SQL
> conditions on data that is not readily human readable.
> 
>  Business questions to be answered by the use case are for example:
> * What is the total length of river bank bordered by permeable soil used for
> grazing along a certain river?
> * What types of crops are being cultivated within 100m of water, with total
> land use grouped by crop.
> * What watter bodies are subject to high environmental load from agriculture,
> as defined by little current and extensive use of adjacent land.
>  From the viewpoint of RDB to RDF mapping, this usecase highlights the need to
> integrate data from different databases, built for different purposes.  It
> also  emphasizes need for extensibility in the mapping language for supporting
> RDBMS vendor  specific features.  In the present case, Oracle expresses a
> spatial join using a special type of derived table not found in standard SQL,
> thus the customization need is deeper than just supporting calls to native SQL
> functions.
> 
>  The inference requirement consists primarily of expanding class membership
> into and's and or's of conditions on the relational data.  In some cases,
> these conditions are spatial, such as bordering on or contained in.  The user
> should  be familiar with the ontologies but should not have to know about the
> classification codes used in the databases.
> 
> 1.2 Liaisons
>  The WG must track the evolution of SPARQL and liaise with the DAWG WG as well
> as the OWL WG.  The proposed WG will also keep track of work on assigning
> unique identifiers to well-known entities such as the ENS system associated
> with the OKKAM project [OKKAM] <#okkam>  and the Common Naming Project started
> by Neuro Commons [Common Naming Project] <#CommonNaming>
> 
> 1.3 Starting Points
>  The WG will take as its starting point the mapping languages developed by the
> [D2RQ] <#D2RQ>  and [Virtuoso] <#Virtuoso>  efforts.
> 
> 2 References
> Common Naming Project Neuro Commons Common Naming Project
> <http://neurocommons.org/page/Common_Naming_Project> , Science Commons, Sept
> 17, 2008. (See http://neurocommons.org/page/Common_Naming_Project.)D2RQ  The
> D2RQ Platform v0.5.1, User Manual and Language Specification
> <http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/> , Chris Bizer, Richard
> Cyganiak, Jorg Garbers, Oliver Maresch (See
> http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/.)RIF W3C Rule Interchange
> Format Working Group <http://www.w3.org/2005/rules/wiki/RIF_Working_Group>
> (See http://www.w3.org/2005/rules/wiki/RIF_Working_Group.)LinkedData Design
> Issues for Linked Data <http://www.w3.org/DesignIssues/LinkedData.html> , Tim
> Berners-Lee (See http://www.w3.org/DesignIssues/LinkedData.html.)StateOfArt
> Mapping Relational Data to RDF and OWL: A Literature Survey
> <http://esw.w3.org/topic/Rdb2RdfXG/> , Satya Sahoo, Wolfgang Halb (See
> http://esw.w3.org/topic/Rdb2RdfXG/.)OKKAM An Entity Name System (ENS) for the
> Semantic Web <http://www.okkam.org/> , Paolo Bouquet, Heiko Stoermer, Barbara
> Bazzanella, January 2008. (See http://www.okkam.org/.)Virtuoso  Virtuoso
> Open-Source Edition  <http://virtuoso.openlinksw.com/wiki/main/Main/>  (See
> http://virtuoso.openlinksw.com/wiki/main/Main/.)Triplify Triplify -
> Lightweight Linked Data Publication from Relational Databases, submitted to
> WWW 2009  
> <http://www.informatik.uni-leipzig.de/~auer/publication/triplify.pdf> Auer,
> Dietzold, Lehmann, Hellmann, Aumueller (See
> http://www.informatik.uni-leipzig.de/~auer/publication/triplify.pdf.)NicoteneD
> ependence An ontology-driven semantic mashup of gene and biological pathway
> information: Application to the domain of nicotine dependence
> <http://dx.doi.org/10.1016/j.jbi.2008.02.006 > Satya S. Sahoo, Olivier
> Bodenreider, Joni L. Rutter, Karen J. Skinner and Amit P. Shetha       (See
> http://dx.doi.org/10.1016/j.jbi.2008.02.006 .)
Received on Tuesday, 13 January 2009 17:11:09 UTC