Re: RDB2RDF mapping: Do we really need any alternative to use of SQL queries with conventions and a "trivial" mapping language? from Souripriya Das on 2010-03-22 (public-rdb2rdf-wg@w3.org from March 2010)

From: Souripriya Das <SOURIPRIYA.DAS@oracle.com>
Date: Mon, 22 Mar 2010 14:44:03 -0700 (PDT)
To: <richard@cyganiak.de>
Cc: <public-rdb2rdf-wg@w3.org>
Message-ID: <1a751cd4-7063-46dc-b824-697b75117826@default>
Richard,

Thanks a lot for a very informative response. I enjoyed reading it (including the humor) and I was glad that it was long.

What you said about "triple view" makes sense. However, as I have indicated a few times during the telcons and also during our presentation:
   we do not use the "triple view" approach with SQL. 

Instead we use the "relational query" approach. Thus the resulting logical tables are not restricted to having only three (sub/pred/obj) columns. Each such logical table can have as many columns as needed. (Thus, our approach is very similar to Triplify's approach and quite different from UltraWrap's approach.)

With our approach, SPARQL-to-SQL translation generates SQL that is pretty concise -- definitely not "humongous" -- and also fairly easy to optimize (that can be accomplished even by "lesser" SQL optimizers :-)).

Thanks,
- Souri.

----- Original Message -----
From: richard@cyganiak.de
To: souripriya.das@oracle.com
Cc: public-rdb2rdf-wg@w3.org
Sent: Monday, March 22, 2010 1:40:09 PM GMT -05:00 US/Canada Eastern
Subject: Re: RDB2RDF mapping: Do we really need any alternative to use of SQL  queries with conventions and a "trivial" mapping language?

On 22 Mar 2010, at 06:03, Souripriya Das wrote:
> So far I have not seen or heard any convincing arguments to  
> establish that we need anything more than SQL and a "trivial"  
> mapping language. Before going for an alternative, we must first  
> establish the need for such an alternative.

Fair enough.

I have thought a lot about this in the previous weeks and have been  
flip-flopping on the issue. This is why I haven't written up a better  
explanation of the problem previously -- I'm not yet 100% sure what my  
own opinion on the issue is.

Anyway, I will explain it here as good as I can. This is going to be  
quite long, sorry about that.


1. Why the SQL query based approach is nice
-------------------------------------------

First, I definitely see the attraction of the SQL query based  
approach. I understand this approach as: leveraging the expressivity  
of SQL to do as much of the mapping/transformation as possible, with  
some simple glue around it that essentially turns each SQL result  
record into a few triples according to some simple rules.

It's attractive because the approach leverages existing SQL knowledge  
of mapping authors; it maximises expressivity; it means we don't have  
to specify a large chunk of the problem ourselves; it produces  
syntactically compact mappings. So, purely from an authoring point of  
view it is definitely a nicer approach than any of the proposed  
alternatives (D2RQ, Virtuoso RDF views, R2O etc).

In order to run SPARQL queries against such a mapped database, one  
would use the “triple view” approach, as detailed in Juan's work. So  
the SPARQL-to-SQL engine would create a single view in the DB which  
consists of lots of unions and in the end contains one row for each  
mapped triple, with subject, predicate and object. How to run SPARQL  
queries against such a relational structure is well-known from prior  
work on database-backed triple stores. The result is a humongous SQL  
query over a humongous view definition, but as Juan has shown, good  
SQL optimizers can simplify this into a reasonable query plan.

So here is why I argue against this approach.


2. Why the SQL query based approach fails in some cases
-------------------------------------------------------

First, I assume read-only access to the database. I cannot create  
custom views. So, to run SPARQL queries with the approach above, I'd  
have to use sub-SELECTs rather than views, which in theory should work  
just fine and should be an implementation detail.

But second, I assume that we use the query optimizer of MySQL, which  
is unable to simplify the humongous SQL query from the approach  
described above into something that runs in acceptable runtime (as I  
demonstrated in [1]).

Now if you happen to work for Oracle then you might say, “well they  
should just use a real database.” We can all chuckle about that for a  
minute and then get back to business. There are existing systems, such  
as D2RQ, that, whatever their limitations, produce decent performance  
of MySQL and other “lesser” database engines. This group *has* to  
standardise on a solution that is implementable on such engines.

So, how do we get acceptable performance on MySQL and other “lesser”  
RDBMS, if we cannot use the “triple view” or “triple subselect”  
approach?

Well, we cannot translate SPARQL queries into humongous SQL queries  
and then rely on the DB engine to simplify it so it runs in a  
reasonable time. We have to be smarter in the translation, and create  
SQL queries that are reasonably optimised straight away. I will not  
get into the details, which are complicated, but it means we can no  
longer treat the mapping's SQL queries as opaque blobs of SQL text  
that we can just pass to the DB without looking at them -- we have to  
dive into the SQL queries that define the mapping, analyse what they  
are doing, and take them apart.


3. How has this problem been solved in practice to date?
--------------------------------------------------------

Here is the “worse is better” approach to solve this problem: We can  
ask the *mapping author* to do the work for us and decompose the SQL  
query into simpler elements (join conditions, projection expressions,  
selection conditions and so on) and explain how they relate to each  
other through the structure of the mapping file. Then the SPARQL-to- 
SQL translation engine can build the optimised SQL query straight from  
these simpler SQL fragments. This is what is done in the D2RQ mapping  
language (see [2]).

It is noteworthy that, to my knowledge, *every* RDB2RDF system to date  
that supports the evaluation of SPARQL queries over mapped databases,  
and assumes read-only access to the database, has opted for an  
approach similar to this: D2RQ, OpenLink Virtuoso, SquirrelRDF, R2O.  
None of their mapping languages specify the mapping using complete SQL  
queries; all languages decompose the queries into small chunks.

To the best of my knowledge, there is *no* existing implementation  
that supports SPARQL over the mapped database, supports read-only  
access, and uses a mapping language based on the SQL query approach.  
There are implementations of the SQL query approach that allow RDF  
dumps of a mappded database (e.g., D2R Map) or resource-based linked  
data style access (e.g., Triplify). But supporting SPARQL queries over  
the mapped database is a task that is a whole lot more difficult.


4. How can we save the SQL query based approach?
------------------------------------------------

So AFAIK no one has implemented the SQL query approach to support  
SPARQL queries over mapped databases. It doesn't necessarily follow  
that it's impossible, or even a bad idea. Could we specify our  
mappings using arbitrary SQL queries, then translate SPARQL queries  
over those mappings to SQL, and still end up with reasonably optimised  
SQL queries?

If this is possible at reasonable implementation cost, then it would  
be a great way forward.

I can imagine two approaches.

First, you could develop your own custom SQL optimizer that takes the  
humongous SQL query resulting from the triple view approach and  
optimizes it to make the DB engine happy. I assert without proof that  
the implementation cost for this is prohibitive, especially because  
one has to create a different SQL optimiser for each imperfect  
database engine that one wants to support (because their native  
optimisers have different weaknesses, and because their SQL dialects  
differ).

Second approach: Do not allow arbitrary SQL queries in the mapping  
language, but only a restricted subset. Then write a SQL parser that  
is just smart enough to chop these restricted SQL queries into their  
elements (such as join conditions, projection expressions, selection  
conditions and so on).

So, while the existing implementations (D2RQ, Virtuoso, etc) ask the  
mapping author to do the job of decomposing the query into simpler  
elements as part of the process of writing a mapping, we would now  
have a parser that does the same job -- its input is a restricted SQL  
query and its output are those simpler elements.

In practice, this will not be as simple as it might sound. It appears  
that one of the design goals of SQL was to make parser implementation  
as difficult as possible. This is compounded by the many differences  
between SQL dialects.

Nevertheless, this approach seems promising, and it *might* be a way  
of supporting SPARQL queries on MySQL and other “lesser” DB engines,  
over a mapping language that uses the SQL query based approach.


5. Request for an existence proof
---------------------------------

It seems that the proponents of the SQL query based approach fall into  
two camps:

1. Those whose plan to rely on their DB engine's great optimizer for  
doing all the hard work, and don't care wether it works on other  
databases

2. Those who have not really been hit by the practicalities of  
implementing a SPARQL engine over such a mapping when no good SQL  
optimizer is available

Let me repeat that I believe that the SQL query based approach is  
better than the alternatives on almost every scale. The only problem  
is that it has not been shown that it can be implemented at reasonable  
cost in the absence of an advanced SQL optimizer. My concern is this:  
If the group standardises an approach that is only implementable on  
Oracle and SQL Server, then the group has failed. I hope that there is  
consensus on this question; if not, better bring it on the table NOW.


If there was any implementation that used the SQL query based approach  
as a mapping language,
parsed the mapping's SQL queries, and translates SPARQL queries into  
SQL queries that are significantly simpler than the humongous SQL  
queries produced by the “triple view” approach, then I'd be a whole  
lot more confident that the SQL query based approach ban be made to  
work on databases such as MySQL.

A good benchmark might be self-joins. Can you translate SPARQL queries  
over the mapped DB into SQL queries that don't contain self-joins  
(joining a table to itself on the PK)?

So, how would you solve this? Can you make the SQL based approach work  
without an awesome SQL optimizer? What if you have to support multiple  
SQL dialects?

Best,
Richard


[1] http://www.w3.org/2001/sw/rdb2rdf/wiki/PotentialSQLIssues
[2] http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/#specification



>
> Thanks,
> - Souri.
>
Received on Monday, 22 March 2010 21:45:59 UTC