The Issue of Translatability (forward from Orri Erling) from Ivan Mikhailov on 2010-03-26 (public-rdb2rdf-wg@w3.org from March 2010)

From: Ivan Mikhailov <imikhailov@openlinksw.com>
Date: Fri, 26 Mar 2010 20:00:14 +0600
To: "public-rdb2rdf-wg@w3.org" <public-rdb2rdf-wg@w3.org>
Message-Id: <1269612014.28314.530.camel@octo.iv.dev.null>
Colleagues

The below contains some highlights of the conversation of March 23,
2010.  I then expand on some of these issues and finish with a
description of what ought to be proven and what restrictions will be
practically necessary for dealing with the impedance mismatch of SPARQL
and SQL.

Regards

Orri






The Issue of Translatability

Can an  arbitrary SPARQL query  be translated into SQL, 1. against a
triple table 2. against a relational schema?

Souripriya affirms so. Orri seconds, but after the experience of doing
both for some years, qualifies the statement with certain specifics.

Does such translation and its possibility require formal proof?

Souripriya declares that such a proof is trivial. The grounds for this
are that since SQL is accepted as expressive enough, the mapping itself
consists of column renaming only, and column renaming is so simple that
it does not require formal proof. Any schema alignment is expressed as
SQL views/derived tables/scalar subqueries, putting all the complexity
on the SQL side.

We note that as soon as there are expressions in the mapping, these
expressions need defined formal semantics.  But if the expressions are
on the SQL side, then such semantics is not part of the mapping.
 
Eric points out that if one transforms a RDB into XML and then does XSLT
on that and then XQuery on the result, it can be hard in practice, even
though often possible in theory, to take the XQuery and map it into SQL
over the original table. So, by analogy, if the mapping is very oblique,
passing through many SQL complexities, it will be hard to implement a
SPARQL to SQL translation that can make good SQL.

Ahmed says that if all you do is query one database, you may just as
well stay with SQL. The mapping will be needed in cases of integration
between multiple, semantically similar but structurally different
sources. Having this, one needs to again prove the semantics of the
mapping.

Eric asks Richard if one can write queries in D2RQ that will not map
into a single SQL statement.
Richard says that this is easy, there will be cases where multiple SQL
queries will be made and combined in the D2RQ layer. Orri notes that
Virtuoso always makes one  SQL for one  SPARQL query but that the SQL in
question is proprietary extended for cases like variables in the
predicate position. Forgetting variables in predicate position, SQL with
no of little propriatary extensions will result.

Lee asks what the difference between SQL style and RDF style is. Orri
says that it is one of syntax flavor. Inside, things will be RDF style
but the RDF style can be inferred from a SQL looking outward syntax and
that an SQL looking outward syntax is likely to be better accepted by a
SQL public.

***

The following  is commentary by Orri on the  above  excerpts of
conversation.

The key issue seems to be whether the mapping reduces to column renaming
and identifying columns as subject/predicate/object/graph and indicating
whether a string column is considered a string or an IRI.  We note that
the columns can be expressions and that the where clause may have
arbitrary SQL.  This is pretty much Virtuoso RDF Views.

As a shorthand, we can then make multicolumn selects and pick one column
as subject and other columns as values of fixed predicates.  This gives
us Triplify.

We could argue that we would need a proof of the possibility to
transform arbitrary SPARQL into SQL via such a mapping.  We specially
note that many views/select statements may produce the same
subject/predicate/object.  Thus any SQL generator needs to know when to
make union.  For performance it will need to know when not to make
unions.  Dealing with unions and not generating union clauses which are
always empty at run time is the principal difficulty of mapping in
OpenLink's experience.  

However, the matter of dealing with a triple pottentially coming from
multiple sources is,  in the strict sense, outside of the scope of the
mapping WG and is an implementation optimization. On the other hand, we
could argue that the standard being developed ought to give support for
such optimization where possible, or at least ought not to make such
optimization harder than it needs to be.  
We note that the primary use case is integration, i.e. similar looking
triples from many sources.

It has been pointed out that the WG's agenda is making a mapping
language and that this mapping language does not need to be concerned
with implementation, specifically not the ways in which SPARQL is
trannslated to SQL.  This is technically true.  We note  that in  our
experience, some hints have proven necessary. These are necessary
because there is information which is not expressed in the RDB schemas
but is still relevant for optimizing away unions. The mapping author can
know that domains are disjoint even though the RDB schemas do not say
so, for example.  Thus a mechanism for expressing such meta information
is needed in practice.  Whether it can be agreed upon in the WG is a
different matter.


Souripriyas drastic simplification is to cast aside such considerations
and to say that all semantics will be in SQL and the mapping is rename
only, plus saying which column is the subject, which is an IRI and which
a scalar. 

Under such a situation, putting said meta information  in the XML (or
other) glue remains possible.   In triple-centric models said
information goes into the vocabulary.

If the SQL in the mappings is very complex, then the problem of SPARQL
to SQL is similar to reuse of materialized views, for example.  One
needs to compare queries and determine if one subsumes the other.  
If the SQL in the mappings is simple, then a conversion into a
triple-oriented  declaration system like D2RQ or Virtuoso is almost
trivial.


***

Let us consider what precisely we should prove in order to say something
about the expressibility of SPARQL in SQL through a mapping.
We will first consider a fairly powerful mapping, i.e. Virrtuoso quad
maps. We will then restrict this to make the matter more manageable.

With quad maps, a mapping is expressed as a collection of SQL select
statements  of exactly 4 columns each.  The columns are the G, S, P and
O of an RDF quad (triple plus graph).  If there is a row in any of the
select statements constituting the mapping, the quad is considered to
exist and is thus retrievable by any SPARQL query that would find it if
the quad physically existed in a quad store.  The columns are SQL scalar
expressions.  We declare whether the O is a literal or IRI.  Of each
IRI, we can specify an IRI class.  Of the IRI classes we may assert
subclass and disjointness. The where and from of the selects is de facto
unrestricted since the quad map may  refer to a view.  Virtuoso expands
the views, hence if the tables are in another DBMS, the other DBMS can
be read-only.

We note that a Triplify or Oracle proposal style syntax where a constant
predicate is assigned to columns of a multi-column select with an AS or
similar is trivially transformable into the above quad maps.  This is
why we do not particularly worry about syntax.

We are now to prove that any SPARQL 1.1 query may be transformed into
SQL via such maps.

We encounter the following issues:

We query select ?o from <people> where { <person1> foaf:name ?o }

Suppose that multiple quad maps match the <people> graph and the
foaf:name predicate.  Suppose the expression in one of them is a
NVARCHAR and a VARCHAR in another.  The single result column has two
data types.  Logic needs to exist to do the right cast when gennerating
the SQL or the SQL engine evaluating the union must be typed at run
time.  The latter is not often the case, even though Virtuoso and
probably MS SQL Server could do it. Others could do it through use of
UDT's, which would complicate things and take space and time at run
time. UDT's are also not  very standardized, even though most DBMS's
have such.


We query select ?p ?o from <people> where { <person1> ?p ?o}

Again, multiple quad mapts  match and we get a union.  The union has
single column selects, one per  mapped column of each table.  The type
is again arbitrary, requiring typing at run time.  Since the same row is
accessed for each column, the optimizer can merge these into a single
multi-column select but then the single row that is fetched from each
table containing people will have to be presented as multiple rows in a
result set.  Again, this is not impossible through the use of a table
valued function.  Virtuoso has a special SQL extension for this but
could do it also with table valued functions.  But table valued
functions are a vendor specific thing, thus we are outside of core SQL.

We suppose a gender column in two person tables.  One has M and F and
the other 0 and 1.  Both map to an IRI representing gender.
This is mapped into an IRI by a user defined function in each case.

We query

select count (*) from  <people> where {?p a <person> . ?p <gender>
<male> }

The SQL becomes a count of a union and the unions select gender where
UDF (gender) = constant.

In the general case, there should be a function index or the SQL
implementation or the SPARQL to SQL mapping should know about the
inverse function of the gender UDF.  A function index on a low
cardinality column like gender is not very practical because one would
do a table scan rather than an index lookup with such cardinality.  So
for efficiency the inverse should be  declarable and supported.  We
could further   reasonably expect a SQL implementation to transform a
count of a union into a sum  of union term counts.

Given our multiple tables with people, let us join to department and
then group by the head of department, counting the people working under
each department head.  We suppose there are multiple EMP and DEPT tables
and that each  DEPT table has  a fk into EMP for the department head.

select ?head count (*) from <people> where { ?p a <person> . ?p
<has_dept> ?dept . ?dept <has_dept_head> ?head } group by ?head.

Each EMP table has employee id'spotentially overlapping with the other
ones.  So we must not join between the wrong EMP and DEPT tables.   This
is accomplished by mapping each with a distinct IRI scheme so the
intersection of the subject IRI's  is empty.  

The SQL generator must know that the intersection of the EMP IRI's from
the many EMP tables is empty, else it will generate a union where each
EMP table is joined with each DEPT table.  Of these, only one will be
non-empty due to the distinct IRI schemes but the implementation will
end up doing prohibitive extra work unless it is smart about this.

Up to now, we have not found anything impossible.  But we have found
many cases which are easily very awkward to map to SQL and require
features outside of the standard (UDT, run time typing, table valued
functions).

We note that for mapping to be viable where it is most needed, i.e. data
integration for supporting analytics, the SQL produced and executed
cannot be substantially different from that any relational report
generator would make.  The DBMS's are written to deal fairly well with
such SQL, not necessarily with SQL with UDT's in the place of base data
types or table valued functions inside large joins.

There is a big difference between feasible in theory and workable in
practice.  

Let us now look at what we ought to prohibit in order to stay within
non-extended SQL.  The variable in the predicate position has to go,.
Also, we must require that types declared for O's of the same P be all
castable into a common supertype within base SQL, e.g. varchar and
nvarchar become nvarchar.  With numbers we already have more of an issue
with integer, decimal and double.  Depending on the  application whether
you want precision or range, the correct answer is double or decimal.
Such things must be declarable in mappings since the answer is
application dependent.

Prohibiting a variable P, we would ipso facto always have a constant in
the P position of the quad map.  If a variable P were allowed there but
prohibited in the queries, we would not be free of the type casting
problems.  Aside this, we do not have any particular problem with the G
being computed.

When we require the P of the quad maps to be constant, we end up with a
mapping that is most readily expressible a la Triplify, i.e. a
multicolumn select where exactly one column is declared to be the
subject, one is declared as the graph (most often a constant) and others
are assigned constant predicates.

Thus we should prove that  given mappings of this sort, SPARQL 1.1 with
no variables in predicate position transforms to single SQL
statements.  

This seems to be the case, as experience demonstrates that even
unrestricted SPARQL transforms into single SQL statements, albeit
relying on run time types and other extensions.  We have not undertaken
a formal proof of this but the WG could well produce such.
Received on Friday, 26 March 2010 14:00:46 UTC