Re: RDB2RDF mapping: Do we really need any alternative to use of SQL queries with conventions and a "trivial" mapping language? from Richard Cyganiak on 2010-03-22 (public-rdb2rdf-wg@w3.org from March 2010)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 22 Mar 2010 23:47:01 +0000
To: ashok.malhotra@oracle.com
Cc: Souripriya Das <souripriya.das@oracle.com>, Public-Rdb2rdf-Wg <public-rdb2rdf-wg@w3.org>
Message-Id: <CA71FAB8-BC37-41B5-ADCD-1AA0E5BF16F9@cyganiak.de>
Ashok,

On 22 Mar 2010, at 22:03, ashok malhotra wrote:
> I think it's a poor argument to say that because MySQL does a poor  
> job with sub-selects
> we should abandon the SQL-view approach.

I did not say that the SQL query approach should be abandoned.

I also did not cite problems with sub-selects as a reason for my  
reservations about the approach.

So let's turn this around.

There are several perfectly fine RDB2RDF mapping languages and RDB2RDF  
implementations that have been around for years and that do quite well  
on those “lesser” databases. None of them use the SQL query approach.  
It would be hardly unreasonable to argue that one of those proven  
approaches should be standardised by the WG, rather than the SQL query  
based approach.

 From that point of view, the need for the SQL query based approach  
has not been conclusively shown. It is not even clear if this approach  
can be implemented efficiently on any database engines other than yours.

Best,
Richard



> As I said in my earlier note
> Pl. take a look at http://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html
> which explains how to rewrite subqueries as joins.
>
> Also, the focus of this WG is on mapping Relational Data to RDF, not  
> on optimizing SPARQL
> queries.  I think this is a very interesting area in which various  
> vendors will compete and some
> approaches and some databases will do better than others.
> All the best, Ashok
>
>
> Richard Cyganiak wrote:
>> On 22 Mar 2010, at 06:03, Souripriya Das wrote:
>>> So far I have not seen or heard any convincing arguments to  
>>> establish that we need anything more than SQL and a "trivial"  
>>> mapping language. Before going for an alternative, we must first  
>>> establish the need for such an alternative.
>>
>> Fair enough.
>>
>> I have thought a lot about this in the previous weeks and have been  
>> flip-flopping on the issue. This is why I haven't written up a  
>> better explanation of the problem previously -- I'm not yet 100%  
>> sure what my own opinion on the issue is.
>>
>> Anyway, I will explain it here as good as I can. This is going to  
>> be quite long, sorry about that.
>>
>>
>> 1. Why the SQL query based approach is nice
>> -------------------------------------------
>>
>> First, I definitely see the attraction of the SQL query based  
>> approach. I understand this approach as: leveraging the  
>> expressivity of SQL to do as much of the mapping/transformation as  
>> possible, with some simple glue around it that essentially turns  
>> each SQL result record into a few triples according to some simple  
>> rules.
>>
>> It's attractive because the approach leverages existing SQL  
>> knowledge of mapping authors; it maximises expressivity; it means  
>> we don't have to specify a large chunk of the problem ourselves; it  
>> produces syntactically compact mappings. So, purely from an  
>> authoring point of view it is definitely a nicer approach than any  
>> of the proposed alternatives (D2RQ, Virtuoso RDF views, R2O etc).
>>
>> In order to run SPARQL queries against such a mapped database, one  
>> would use the “triple view” approach, as detailed in Juan's work.  
>> So the SPARQL-to-SQL engine would create a single view in the DB  
>> which consists of lots of unions and in the end contains one row  
>> for each mapped triple, with subject, predicate and object. How to  
>> run SPARQL queries against such a relational structure is well- 
>> known from prior work on database-backed triple stores. The result  
>> is a humongous SQL query over a humongous view definition, but as  
>> Juan has shown, good SQL optimizers can simplify this into a  
>> reasonable query plan.
>>
>> So here is why I argue against this approach.
>>
>>
>> 2. Why the SQL query based approach fails in some cases
>> -------------------------------------------------------
>>
>> First, I assume read-only access to the database. I cannot create  
>> custom views. So, to run SPARQL queries with the approach above,  
>> I'd have to use sub-SELECTs rather than views, which in theory  
>> should work just fine and should be an implementation detail.
>>
>> But second, I assume that we use the query optimizer of MySQL,  
>> which is unable to simplify the humongous SQL query from the  
>> approach described above into something that runs in acceptable  
>> runtime (as I demonstrated in [1]).
>>
>> Now if you happen to work for Oracle then you might say, “well they  
>> should just use a real database.” We can all chuckle about that for  
>> a minute and then get back to business. There are existing systems,  
>> such as D2RQ, that, whatever their limitations, produce decent  
>> performance of MySQL and other “lesser” database engines. This  
>> group *has* to standardise on a solution that is implementable on  
>> such engines.
>>
>> So, how do we get acceptable performance on MySQL and other  
>> “lesser” RDBMS, if we cannot use the “triple view” or “triple  
>> subselect” approach?
>>
>> Well, we cannot translate SPARQL queries into humongous SQL queries  
>> and then rely on the DB engine to simplify it so it runs in a  
>> reasonable time. We have to be smarter in the translation, and  
>> create SQL queries that are reasonably optimised straight away. I  
>> will not get into the details, which are complicated, but it means  
>> we can no longer treat the mapping's SQL queries as opaque blobs of  
>> SQL text that we can just pass to the DB without looking at them --  
>> we have to dive into the SQL queries that define the mapping,  
>> analyse what they are doing, and take them apart.
>>
>>
>> 3. How has this problem been solved in practice to date?
>> --------------------------------------------------------
>>
>> Here is the “worse is better” approach to solve this problem: We  
>> can ask the *mapping author* to do the work for us and decompose  
>> the SQL query into simpler elements (join conditions, projection  
>> expressions, selection conditions and so on) and explain how they  
>> relate to each other through the structure of the mapping file.  
>> Then the SPARQL-to-SQL translation engine can build the optimised  
>> SQL query straight from these simpler SQL fragments. This is what  
>> is done in the D2RQ mapping language (see [2]).
>>
>> It is noteworthy that, to my knowledge, *every* RDB2RDF system to  
>> date that supports the evaluation of SPARQL queries over mapped  
>> databases, and assumes read-only access to the database, has opted  
>> for an approach similar to this: D2RQ, OpenLink Virtuoso,  
>> SquirrelRDF, R2O. None of their mapping languages specify the  
>> mapping using complete SQL queries; all languages decompose the  
>> queries into small chunks.
>>
>> To the best of my knowledge, there is *no* existing implementation  
>> that supports SPARQL over the mapped database, supports read-only  
>> access, and uses a mapping language based on the SQL query  
>> approach. There are implementations of the SQL query approach that  
>> allow RDF dumps of a mappded database (e.g., D2R Map) or resource- 
>> based linked data style access (e.g., Triplify). But supporting  
>> SPARQL queries over the mapped database is a task that is a whole  
>> lot more difficult.
>>
>>
>> 4. How can we save the SQL query based approach?
>> ------------------------------------------------
>>
>> So AFAIK no one has implemented the SQL query approach to support  
>> SPARQL queries over mapped databases. It doesn't necessarily follow  
>> that it's impossible, or even a bad idea. Could we specify our  
>> mappings using arbitrary SQL queries, then translate SPARQL queries  
>> over those mappings to SQL, and still end up with reasonably  
>> optimised SQL queries?
>>
>> If this is possible at reasonable implementation cost, then it  
>> would be a great way forward.
>>
>> I can imagine two approaches.
>>
>> First, you could develop your own custom SQL optimizer that takes  
>> the humongous SQL query resulting from the triple view approach and  
>> optimizes it to make the DB engine happy. I assert without proof  
>> that the implementation cost for this is prohibitive, especially  
>> because one has to create a different SQL optimiser for each  
>> imperfect database engine that one wants to support (because their  
>> native optimisers have different weaknesses, and because their SQL  
>> dialects differ).
>>
>> Second approach: Do not allow arbitrary SQL queries in the mapping  
>> language, but only a restricted subset. Then write a SQL parser  
>> that is just smart enough to chop these restricted SQL queries into  
>> their elements (such as join conditions, projection expressions,  
>> selection conditions and so on).
>>
>> So, while the existing implementations (D2RQ, Virtuoso, etc) ask  
>> the mapping author to do the job of decomposing the query into  
>> simpler elements as part of the process of writing a mapping, we  
>> would now have a parser that does the same job -- its input is a  
>> restricted SQL query and its output are those simpler elements.
>>
>> In practice, this will not be as simple as it might sound. It  
>> appears that one of the design goals of SQL was to make parser  
>> implementation as difficult as possible. This is compounded by the  
>> many differences between SQL dialects.
>>
>> Nevertheless, this approach seems promising, and it *might* be a  
>> way of supporting SPARQL queries on MySQL and other “lesser” DB  
>> engines, over a mapping language that uses the SQL query based  
>> approach.
>>
>>
>> 5. Request for an existence proof
>> ---------------------------------
>>
>> It seems that the proponents of the SQL query based approach fall  
>> into two camps:
>>
>> 1. Those whose plan to rely on their DB engine's great optimizer  
>> for doing all the hard work, and don't care wether it works on  
>> other databases
>>
>> 2. Those who have not really been hit by the practicalities of  
>> implementing a SPARQL engine over such a mapping when no good SQL  
>> optimizer is available
>>
>> Let me repeat that I believe that the SQL query based approach is  
>> better than the alternatives on almost every scale. The only  
>> problem is that it has not been shown that it can be implemented at  
>> reasonable cost in the absence of an advanced SQL optimizer. My  
>> concern is this: If the group standardises an approach that is only  
>> implementable on Oracle and SQL Server, then the group has failed.  
>> I hope that there is consensus on this question; if not, better  
>> bring it on the table NOW.
>>
>>
>> If there was any implementation that used the SQL query based  
>> approach as a mapping language,
>> parsed the mapping's SQL queries, and translates SPARQL queries  
>> into SQL queries that are significantly simpler than the humongous  
>> SQL queries produced by the “triple view” approach, then I'd be a  
>> whole lot more confident that the SQL query based approach ban be  
>> made to work on databases such as MySQL.
>>
>> A good benchmark might be self-joins. Can you translate SPARQL  
>> queries over the mapped DB into SQL queries that don't contain self- 
>> joins (joining a table to itself on the PK)?
>>
>> So, how would you solve this? Can you make the SQL based approach  
>> work without an awesome SQL optimizer? What if you have to support  
>> multiple SQL dialects?
>>
>> Best,
>> Richard
>>
>>
>> [1] http://www.w3.org/2001/sw/rdb2rdf/wiki/PotentialSQLIssues
>> [2] http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/#specification
>>
>>
>>
>>>
>>> Thanks,
>>> - Souri.
>>>
>>
>>
Received on Monday, 22 March 2010 23:47:37 UTC