Re: Comments on the SQL-based proposal from Souri Das on 2010-08-10 (public-rdb2rdf-wg@w3.org from August 2010)

From: Souri Das <Souripriya.Das@oracle.com>
Date: Tue, 10 Aug 2010 11:59:59 -0400
To: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
Message-ID: <4C61777F.8010409@Oracle.com>
Very good points from Richard. I agree with most of the issues. Please 
see my comments inline. -- Souri.

Richard Cyganiak wrote:
> All,
>
> Here is a collection of comments on the mapping language proposal that 
> Souri presented in today's call. It's great that we now have a 
> concrete proposal on the table that we can iteratively refine. I refer 
> to the XML schema and example instances here:
>
> http://www.w3.org/2001/sw/rdb2rdf/wiki/An_XML_Schema_for_SQL-based_RDB2RDF_Mapping 
>
> http://www.w3.org/2001/sw/rdb2rdf/wiki/Revised_Example_of_SQL-Query_based_Approach 
>
>
> I have also previously studied Souri's presentation from the F2F, 
> which presented a previous version of the same idea in an abstract 
> syntax:
>
> http://www.w3.org/2001/sw/rdb2rdf/wiki/File:RDB2RDF_Map_viaSQL_F2F_at_SemTech2010.pdf 
>
>
> I grouped my comments into three categories:
>
> 1. Non-functional improvements, that is, issues where I believe that 
> the same effect could be achieved in a better way;
>
> 2. Proposals for adding expressivity, to make sure that the language 
> covers all of RDF and is suitable for some real-world mapping tasks 
> that currently are impossible or very hard to achieve;
>
> 3. Proposals related to authoring convenience, that is, cases where 
> maintainability and usability can be improved by adding some syntactic 
> sugar or other optional features.
>
> As a further comment, I don't believe that XML is the right choice for 
> the surface syntax of this language, and I think that the mapping 
> itself should be expressed as an RDF graph. But that is a separate 
> discussion, and I will for now assume that an RDF representation of 
> the mapping itself could be derived easily at a later time.
>
> Cheers,
> Richard
>
>
>
> 1. Non-functional improvements
> ==============================
>
> 1.1. Don't enclose generated URIs in pointy brackets
> ----------------------------------------------------
> The proposal demands the generation of URIs enclosed in pointy 
> brackets, for example:
>
> SELECT '<xyz.com/emp/' || empno || '>' AS empURI
>
> This is unnecessary, the following works just as well:
>
> SELECT 'xyz.com/emp/' || empno AS empURI
>
That is fine with me. Since any such properties will be marked with 
attribute RDFTerm="IRI", the specification reader will know that the 
value should be a URI/IRI.
>
> 1.2. Name of InstURIPropertyMap
> -------------------------------
> The element defines instance URIs, but has nothing to do with 
> properties, it should be renamed to reflect that.
>
I agree. It identifies the subject, not any property. I propose changing 
it from InstURIPropertyMap to either InstURISubjectMap or just 
InstanceURIMap.
>
> 1.3. Issues with GraphURIPropertyMap
> ------------------------------------
> It appears that GraphURIPropertyName is unnecessary; just providing a 
> way of generating a graph name URI is sufficient for placing triples 
> into an appropriate graph. What is the use of this property?
>
I disagree for the following reason: For "pointing from the PropertyMap 
to the GraphURIPropertyMap" (your next point, which I agree with) we 
need a GraphURIPropertyName to allow the many-to-many relationship 
between the PropertyMaps and GraphURIPropertyMaps.

> To selectively place different properties of a ClassMap in different 
> graphs, the proposal allows referencing of the PropertyMaps from the 
> GraphURIPropertyMap by specifying the RDF property name used in the 
> PropertyMap. This only works if we assume that RDF properties are 
> unique within a ClassMap. That's an unnecessary restriction. A cleaner 
> design would be to invert the direction of the reference, pointing 
> from the PropertyMap to the GraphURIPropertyMap that specifies where 
> the generated triple should reside.
>
I agree that we could allow two or more PropertyMaps to have the same 
value for their PropertyName attributes. (It gets a bit odd if two 
different PropertyMaps have the same value for their PropertyName 
attributes, but different values for their RDFTerm attributes! We could 
restrict it, but "anything goes in RDF." :-))

If we do allow this (i.e., multiple PropertyMaps with same value for 
their PropertyName attributes), then we need to have pointers from 
PropertyMaps to GraphURIPropertyMaps. But, as I noted earlier, that 
would also require having GraphURIPropertyName attributes.
>
> 1.4. Issues with KeyMaps
> ------------------------
> It appears that the proposal requires specification of all key 
> constraints in the mapping language. Often this will duplicate 
> information that can be determined automatically from the database 
> schema. This should be avoided if at all possible.
I disagree for the following reason: Unless we are using tables directly 
(using a tableName element instead of SQLDefString, as you pointed out 
in 3.3), it is difficult to expect constraint definitions to be 
available from the database for a specified SQLDefString. So, in the 
simple case, we can skip requiring KeyMaps (and many other things) in 
the mapping, but we need to allow this for the general case.
>
> It appears that the proposal works by expressing constraints (PK, 
> unique, FK) on RDF properties. It would be better to express them as 
> constraints on the columns in the result sets generated by the 
> views/queries. This is because the constraint-carrying column may 
> actually not be exposed directly in the generated RDF. For example, 
> the PK column of a table is often not mapped to a property, but 
> becomes part of the subject URI. Thus, in the emp table, the PK 
> constraint should not be expressed on the emp:empno property, but on 
> the empno (or empURI) column in the view.
This is very important issue and I'll simplify the mapping specification 
requirement based on the following arguments:

Although the XML schema that I presented allowed specification of the 
key definitions, we should avoid that because it may be complicated and 
I think it is unnecessary.

The questions are
1) what does the query writer need regarding keys to be able to use it 
in his/her query and
2) what do we really need to know about keys for SPARQL-to-SQL translation?

The mapping needs a (special) PropertyMap corresponding to a referential 
constraint so that the corresponding PropertyName can be used by the 
query writer to write graph patterns such as {... ?e   :c_ref_deptno  ?d 
...}. To translate it what we need is to know the corresponding views 
and the join condition: FROM emp e1, dept d1 and WHERE ... "e1.deptno = 
d1.deptno". Note that actual constraint could use expressions (requiring 
complex condition such as, e1.deptno+1 = d1.deptno-1).

So, all we need to have in the mapping specification are:
1) the PropertyMap corr to the referntial constraint and
2) a template for the join condition: one or more of:  
<child_table>.<expression-using-child-table-cols> = 
<paraent_table>.<expression-using-parent-table-cols>

We also need the cols used in the expressions to be projected in the 
SELECT-lists of the respective (parent and child) tables.

So, in summary, we do not need to know the complete definitions of keys, 
we just need the PropertyMap (for query writer to use in his/her query) 
for the referential constraint(s) and a template for the SPARQL-to-SQL 
translator to use when translating use of such a property in the SPARQL 
query. However, the columns used in the template must be projected from 
the SQLDefString defintiions.
>
>
> 2. Adding expressivity
> ======================
>
> 2.1. Allow blank nodes as subjects
> ----------------------------------
> The proposal only allows URIs in the subject positions of triples. RDF 
> also allows blank nodes. This could be addressed by allowing an 
> InstBNodeMap with one or more DB columns in place of the 
> InstURIPropertyMap.
I agree. But, we will have to assume that conceptually a different bNode 
label is used for each row.
>
>
> 2.2. Allow language tags for literals
> -------------------------------------
> The proposal does not allow the specification of language tags on 
> literals. This could be addressed by adding an attribute to the 
> PropertyMap. It is worth exploring wether computed language tags, 
> whose values come from a view column rather than from a constant 
> attribute value, should be supported as well.
I agree. Specifying an attribute (e.g., Language="en-us") will allow all 
values to be associated with a constant language tag. However, the 
computed language-tag case can be handled without any special provision 
simply be having the value, computed for the DBColumnName, to have a 
@<lang-tag> component as part of the value. (I am assuming that computed 
language tag is not restricted to a predefined set, so it is okay to 
have say xy-abc as a language-tag even though there is no such language.)
>
>
> 2.3. Allow custom datatypes for literals
> ----------------------------------------
> The proposal doesn't yet fully describe how datatypes would be 
> handled, but the intention seems to be that the datatype is determined 
> by the expression that defines the view column. There language should 
> support overriding the datatype with an arbitrary datatype URI. It 
> would be the mapping author's responsibility to ensure that the view 
> column contains values that are legal lexical forms for that datatype. 
> The datatype could be specified in an attribute of the PropertyMap. It 
> is worth exploring wether computed datatypes should be allowed.
Yes, that was the assumption, basically to avoid having to check whether 
somebody is associating a database string-typed column value "John" to 
an xsd:integer datatype. But, there are some benefits relaxing this as 
long as a validation step is required for any mapping specification. 
Regarding computed datatype, I am not in favor because static validation 
is not possible.
>
>
> 2.4. Computed property URIs
> ---------------------------
> It is worth exploring wether computed property URIs should be allowed. 
> In addition to the ProperyName attribute, which provides a constant 
> property URI, there could be a PropertyColumnName that refers to a 
> view column which contains computed property URIs. A use case is a 
> table with an entity-attribute-value schema, which is reasonably 
> common in complex database schemas.
We can have an element ComputedPropertyMap with PropertyDBColumnName as 
an attribute instead of PropertyName.

<ComputedPropertyMap PropertyDBColumnName="attr" DBColumnName="value" />

>
>
> 2.5. Lookup tables
> ------------------
> The proposal doesn't support the use of lookup tables. Often, a 
> database column contains a short alphanumeric code whose meaning is 
> described in some documentation outside of the database. If this is to 
> be mapped into “nice” RDF, with proper meaningful URIs instead cryptic 
> code, then a lookup table has to be used. If that table is not part of 
> the database, and we cannot provide it as part of the database, then 
> the lookup table has to be provided as part of the mapping. A sketch:
>
> <LookupTable id="expense-types">
>   <Entry code="04" value="http://example.org/expense-types/airfare" />
>   <Entry code="05" 
> value="http://example.org/expense-types/bus-train-ferry" />
>   <Entry code="24" 
> value="http://example.org/expense-types/subsistence" />
>   <Entry code="53" 
> value="http://example.org/expense-types/hospitality" />
> </LookupTable>
>
> <ClassMap ...>
>    <SQLdefString>SELECT ... empno ...</SQLdefString>
> </ClassMap>
>
> <PropertyMap DBColumnName="exptype" LookupTable="expense-types" 
> PropertyName="ex:expenseType" />
>
> In theory, this can already been done by writing a lot of nested IFF 
> statements in the view definition, but it appears to be a reasonably 
> common case that warrants special syntax.
This can be done by use of CASE ... END (or decode) statement in the 
SELECT clause. But, if the number of items requiring such mapping is 
large, then it may become too verbose. So, I agree that if the mapping 
language allows specifying such a mapping of values, one or more 
PropertyMaps could use it by having an optional attribute such as 
ValueMapping="lookupTable1". However, if the table is big, using that in 
the SPARQL-to-SQL translation will be difficult and may create a large 
query.
>
>
> 3. Usability and authoring considerations
> =========================================
>
> 3.1. Support for URI templates
> ------------------------------
> While it is possible to construct pretty much arbitrary URIs in SQL by 
> means of concatenation and string functions, SQL is not always the 
> best approach. There should be a way of constructing URIs from view 
> columns outside of the SQL query. Reasons include: i) The same 
> URI-generating expression is often needed multiple times throughout a 
> mapping, which leads to redundancy and imposes maintenance costs. ii) 
> URI generation often involves string escaping, case changes etc, which 
> quickly leads to unwieldy SQL expressions. iii) Standard SQL functions 
> may not cover all needs for URL character escaping. A proposal would 
> be to use a syntax inspired by the URI Templates draft [1] to declare 
> URI patterns at the beginning of the file. The patterns can then be 
> referenced from throughout the file. This would be optional; SQL 
> expressions could still be used for URI construction.
>
> [1] http://www.ietf.org/id/draft-gregorio-uritemplate-04.txt
>
> Example:
>
> <URITemplate id="employee" value="http://example.com/emp/{empno}" />
>
> <ClassMap ...>
>    <SQLdefString>SELECT ... empno ...</SQLdefString>
>    <InstURIPropertyMap template="employee" />
>    ...
> </ClassMap>
>
It seems fine to me.
>
> 3.2. Multiple maps for a single class
> -------------------------------------
> I believe that the proposal assumes that the value of @ClassName on 
> the ClassMap element is unique throughout In many schemas there might 
> be multiple tables whose records represent instances of the same 
> domain vocabulary class. For example, a mapping that generates contact 
> information from a university database might map records in the STAFF 
> table and in the STUDENT table to the same class foaf:Person. This 
> *can* be achieved with a single ClassMap, by forming a SQL UNION of 
> the two tables after appropriate transformations. But this can be 
> awkward because the two tables could have quite different columns, 
> e.g., staff have OFFICE_NO and CONSULTATION_HOURS, while students have 
> MAJOR and FACEBOOK_ID. For authoring convenience it should be allowed 
> to specify two separate ClassMaps, rather than shoehorning everything 
> into a single view. (At that point, ClassMap should probably be 
> renamed to ViewMap or QueryMap or something similar, because it then 
> represents the mapping from a single SQL view/query, and no longer the 
> mapping to a single class.)
>
> I note that both 3.1 and 3.2 are helpful to support scenarios where an 
> ontologist prepares the mapping, and a DBA fills in the SQL queries.
>
>
Not sure. Need to spend some more time thinking about this.
> 3.3. Referring to a table instead of a query/view
> -------------------------------------------------
> To make the simple case simple, there should be a <tableName> element 
> that could be used in place of <SQLdefString>. The following markup
>
>    <tableName>EMP</tableName>
>
> would be a shorthand for the following:
>
>    <SQLdefString>SELECT * FROM EMP</SQLdefString>
>
I agree.
>
> 3.4. Allow constant values in place of column names
> ---------------------------------------------------
> There are reasonably frequent cases where some constant value is to be 
> added to all resources generated by a ClassMap. For example, in the 
> case of the emp table, a triple of the shape
>
>     <http://xyz.com/emp/1234> emp:employeeOf 
> <http://xyz.com/the_company> .
>
> might be added to each employee resource. Note that only the subject 
> of the triple depends on the query result; predicate and object are 
> the same for all triples. This could be achieved by having
>
>     SELECT ... '<http://xyz.com/the_company>' AS company
>
> in the query, along with a PropertyMap
>
>     <PropertyMap PropertyName="emp:employeeOf" RDFTerm="IRI"
>                  DBColumnName="company"/>
>
> but this seems awkward and inefficient; a better alternative relying 
> on syntactic sugar would be
>
>     <PropertyMap PropertyName="emp:employeeOf" RDFTerm="IRI"
>                  ConstantValue="http://xyz.com/the_company" />
>
> and omit the company column from the query.
I agree.
Received on Tuesday, 10 August 2010 16:00:31 UTC