Comments on the SQL-based proposal from Richard Cyganiak on 2010-08-03 (public-rdb2rdf-wg@w3.org from August 2010)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Tue, 3 Aug 2010 22:48:00 +0100
To: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Cc: Souripriya Das <souripriya.das@oracle.com>
Message-Id: <DE7FDB7B-226F-4884-B905-851F027DACC8@cyganiak.de>
All,

Here is a collection of comments on the mapping language proposal that  
Souri presented in today's call. It's great that we now have a  
concrete proposal on the table that we can iteratively refine. I refer  
to the XML schema and example instances here:

http://www.w3.org/2001/sw/rdb2rdf/wiki/An_XML_Schema_for_SQL-based_RDB2RDF_Mapping
http://www.w3.org/2001/sw/rdb2rdf/wiki/Revised_Example_of_SQL-Query_based_Approach

I have also previously studied Souri's presentation from the F2F,  
which presented a previous version of the same idea in an abstract  
syntax:

http://www.w3.org/2001/sw/rdb2rdf/wiki/File:RDB2RDF_Map_viaSQL_F2F_at_SemTech2010.pdf

I grouped my comments into three categories:

1. Non-functional improvements, that is, issues where I believe that  
the same effect could be achieved in a better way;

2. Proposals for adding expressivity, to make sure that the language  
covers all of RDF and is suitable for some real-world mapping tasks  
that currently are impossible or very hard to achieve;

3. Proposals related to authoring convenience, that is, cases where  
maintainability and usability can be improved by adding some syntactic  
sugar or other optional features.

As a further comment, I don't believe that XML is the right choice for  
the surface syntax of this language, and I think that the mapping  
itself should be expressed as an RDF graph. But that is a separate  
discussion, and I will for now assume that an RDF representation of  
the mapping itself could be derived easily at a later time.

Cheers,
Richard



1. Non-functional improvements
==============================

1.1. Don't enclose generated URIs in pointy brackets
----------------------------------------------------
The proposal demands the generation of URIs enclosed in pointy  
brackets, for example:

SELECT '<xyz.com/emp/' || empno || '>' AS empURI

This is unnecessary, the following works just as well:

SELECT 'xyz.com/emp/' || empno AS empURI


1.2. Name of InstURIPropertyMap
-------------------------------
The element defines instance URIs, but has nothing to do with  
properties, it should be renamed to reflect that.


1.3. Issues with GraphURIPropertyMap
------------------------------------
It appears that GraphURIPropertyName is unnecessary; just providing a  
way of generating a graph name URI is sufficient for placing triples  
into an appropriate graph. What is the use of this property?

To selectively place different properties of a ClassMap in different  
graphs, the proposal allows referencing of the PropertyMaps from the  
GraphURIPropertyMap by specifying the RDF property name used in the  
PropertyMap. This only works if we assume that RDF properties are  
unique within a ClassMap. That's an unnecessary restriction. A cleaner  
design would be to invert the direction of the reference, pointing  
from the PropertyMap to the GraphURIPropertyMap that specifies where  
the generated triple should reside.


1.4. Issues with KeyMaps
------------------------
It appears that the proposal requires specification of all key  
constraints in the mapping language. Often this will duplicate  
information that can be determined automatically from the database  
schema. This should be avoided if at all possible.

It appears that the proposal works by expressing constraints (PK,  
unique, FK) on RDF properties. It would be better to express them as  
constraints on the columns in the result sets generated by the views/ 
queries. This is because the constraint-carrying column may actually  
not be exposed directly in the generated RDF. For example, the PK  
column of a table is often not mapped to a property, but becomes part  
of the subject URI. Thus, in the emp table, the PK constraint should  
not be expressed on the emp:empno property, but on the empno (or  
empURI) column in the view.


2. Adding expressivity
======================

2.1. Allow blank nodes as subjects
----------------------------------
The proposal only allows URIs in the subject positions of triples. RDF  
also allows blank nodes. This could be addressed by allowing an  
InstBNodeMap with one or more DB columns in place of the  
InstURIPropertyMap.


2.2. Allow language tags for literals
-------------------------------------
The proposal does not allow the specification of language tags on  
literals. This could be addressed by adding an attribute to the  
PropertyMap. It is worth exploring wether computed language tags,  
whose values come from a view column rather than from a constant  
attribute value, should be supported as well.


2.3. Allow custom datatypes for literals
----------------------------------------
The proposal doesn't yet fully describe how datatypes would be  
handled, but the intention seems to be that the datatype is determined  
by the expression that defines the view column. There language should  
support overriding the datatype with an arbitrary datatype URI. It  
would be the mapping author's responsibility to ensure that the view  
column contains values that are legal lexical forms for that datatype.  
The datatype could be specified in an attribute of the PropertyMap. It  
is worth exploring wether computed datatypes should be allowed.


2.4. Computed property URIs
---------------------------
It is worth exploring wether computed property URIs should be allowed.  
In addition to the ProperyName attribute, which provides a constant  
property URI, there could be a PropertyColumnName that refers to a  
view column which contains computed property URIs. A use case is a  
table with an entity-attribute-value schema, which is reasonably  
common in complex database schemas.


2.5. Lookup tables
------------------
The proposal doesn't support the use of lookup tables. Often, a  
database column contains a short alphanumeric code whose meaning is  
described in some documentation outside of the database. If this is to  
be mapped into “nice” RDF, with proper meaningful URIs instead cryptic  
code, then a lookup table has to be used. If that table is not part of  
the database, and we cannot provide it as part of the database, then  
the lookup table has to be provided as part of the mapping. A sketch:

<LookupTable id="expense-types">
   <Entry code="04" value="http://example.org/expense-types/airfare" />
   <Entry code="05" value="http://example.org/expense-types/bus-train-ferry 
" />
   <Entry code="24" value="http://example.org/expense-types/ 
subsistence" />
   <Entry code="53" value="http://example.org/expense-types/ 
hospitality" />
</LookupTable>

<ClassMap ...>
    <SQLdefString>SELECT ... empno ...</SQLdefString>
</ClassMap>

<PropertyMap DBColumnName="exptype" LookupTable="expense-types"  
PropertyName="ex:expenseType" />

In theory, this can already been done by writing a lot of nested IFF  
statements in the view definition, but it appears to be a reasonably  
common case that warrants special syntax.


3. Usability and authoring considerations
=========================================

3.1. Support for URI templates
------------------------------
While it is possible to construct pretty much arbitrary URIs in SQL by  
means of concatenation and string functions, SQL is not always the  
best approach. There should be a way of constructing URIs from view  
columns outside of the SQL query. Reasons include: i) The same URI- 
generating expression is often needed multiple times throughout a  
mapping, which leads to redundancy and imposes maintenance costs. ii)  
URI generation often involves string escaping, case changes etc, which  
quickly leads to unwieldy SQL expressions. iii) Standard SQL functions  
may not cover all needs for URL character escaping. A proposal would  
be to use a syntax inspired by the URI Templates draft [1] to declare  
URI patterns at the beginning of the file. The patterns can then be  
referenced from throughout the file. This would be optional; SQL  
expressions could still be used for URI construction.

[1] http://www.ietf.org/id/draft-gregorio-uritemplate-04.txt

Example:

<URITemplate id="employee" value="http://example.com/emp/{empno}" />

<ClassMap ...>
    <SQLdefString>SELECT ... empno ...</SQLdefString>
    <InstURIPropertyMap template="employee" />
    ...
</ClassMap>


3.2. Multiple maps for a single class
-------------------------------------
I believe that the proposal assumes that the value of @ClassName on  
the ClassMap element is unique throughout In many schemas there might  
be multiple tables whose records represent instances of the same  
domain vocabulary class. For example, a mapping that generates contact  
information from a university database might map records in the STAFF  
table and in the STUDENT table to the same class foaf:Person. This  
*can* be achieved with a single ClassMap, by forming a SQL UNION of  
the two tables after appropriate transformations. But this can be  
awkward because the two tables could have quite different columns,  
e.g., staff have OFFICE_NO and CONSULTATION_HOURS, while students have  
MAJOR and FACEBOOK_ID. For authoring convenience it should be allowed  
to specify two separate ClassMaps, rather than shoehorning everything  
into a single view. (At that point, ClassMap should probably be  
renamed to ViewMap or QueryMap or something similar, because it then  
represents the mapping from a single SQL view/query, and no longer the  
mapping to a single class.)

I note that both 3.1 and 3.2 are helpful to support scenarios where an  
ontologist prepares the mapping, and a DBA fills in the SQL queries.


3.3. Referring to a table instead of a query/view
-------------------------------------------------
To make the simple case simple, there should be a <tableName> element  
that could be used in place of <SQLdefString>. The following markup

    <tableName>EMP</tableName>

would be a shorthand for the following:

    <SQLdefString>SELECT * FROM EMP</SQLdefString>


3.4. Allow constant values in place of column names
---------------------------------------------------
There are reasonably frequent cases where some constant value is to be  
added to all resources generated by a ClassMap. For example, in the  
case of the emp table, a triple of the shape

     <http://xyz.com/emp/1234> emp:employeeOf <http://xyz.com/the_company 
 > .

might be added to each employee resource. Note that only the subject  
of the triple depends on the query result; predicate and object are  
the same for all triples. This could be achieved by having

     SELECT ... '<http://xyz.com/the_company>' AS company

in the query, along with a PropertyMap

     <PropertyMap PropertyName="emp:employeeOf" RDFTerm="IRI"
                  DBColumnName="company"/>

but this seems awkward and inefficient; a better alternative relying  
on syntactic sugar would be

     <PropertyMap PropertyName="emp:employeeOf" RDFTerm="IRI"
                  ConstantValue="http://xyz.com/the_company" />

and omit the company column from the query.
Received on Tuesday, 3 August 2010 21:48:45 UTC