RE: RDB2RDF mapping: Do we really need any alternative to use of SQL queries with conventions and a "trivial" mapping language? from Ezzat, Ahmed on 2010-03-23 (public-rdb2rdf-wg@w3.org from March 2010)

From: Ezzat, Ahmed <Ahmed.Ezzat@hp.com>
Date: Tue, 23 Mar 2010 14:33:22 +0000
To: "ashok.malhotra@oracle.com" <ashok.malhotra@oracle.com>, Richard Cyganiak <richard@cyganiak.de>
CC: Souripriya Das <souripriya.das@oracle.com>, Public-Rdb2rdf-Wg <public-rdb2rdf-wg@w3.org>
Message-ID: <3B7AE9BA67C72B4891EF21842246A21C8772A1E396@GVW1097EXB.americas.hpqcorp.net>
Hi All,

Let us avoid questioning motives.  I can share that sub-queries works well with many databases including HP, Teradata, etc...
Regards,

Ahmed


-----Original Message-----
From: public-rdb2rdf-wg-request@w3.org [mailto:public-rdb2rdf-wg-request@w3.org] On Behalf Of ashok malhotra
Sent: Tuesday, March 23, 2010 5:16 AM
To: Richard Cyganiak
Cc: Souripriya Das; Public-Rdb2rdf-Wg
Subject: Re: RDB2RDF mapping: Do we really need any alternative to use of SQL queries with conventions and a "trivial" mapping language?

It's a bit of a leap to move from the performance problems of MySQL on 
subselects to
 > It is not even clear if this approach can be implemented efficiently 
on any database engines other than yours.
All the best, Ashok


Richard Cyganiak wrote:
> Ashok,
>
> On 22 Mar 2010, at 22:03, ashok malhotra wrote:
>> I think it's a poor argument to say that because MySQL does a poor 
>> job with sub-selects
>> we should abandon the SQL-view approach.
>
> I did not say that the SQL query approach should be abandoned.
>
> I also did not cite problems with sub-selects as a reason for my 
> reservations about the approach.
>
> So let's turn this around.
>
> There are several perfectly fine RDB2RDF mapping languages and RDB2RDF 
> implementations that have been around for years and that do quite well 
> on those "lesser" databases. None of them use the SQL query approach. 
> It would be hardly unreasonable to argue that one of those proven 
> approaches should be standardised by the WG, rather than the SQL query 
> based approach.
>
> From that point of view, the need for the SQL query based approach has 
> not been conclusively shown. It is not even clear if this approach can 
> be implemented efficiently on any database engines other than yours.
>
> Best,
> Richard
>
>
>
>> As I said in my earlier note
>> Pl. take a look at 
>> http://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html
>> which explains how to rewrite subqueries as joins.
>>
>> Also, the focus of this WG is on mapping Relational Data to RDF, not 
>> on optimizing SPARQL
>> queries.  I think this is a very interesting area in which various 
>> vendors will compete and some
>> approaches and some databases will do better than others.
>> All the best, Ashok
>>
>>
>> Richard Cyganiak wrote:
>>> On 22 Mar 2010, at 06:03, Souripriya Das wrote:
>>>> So far I have not seen or heard any convincing arguments to 
>>>> establish that we need anything more than SQL and a "trivial" 
>>>> mapping language. Before going for an alternative, we must first 
>>>> establish the need for such an alternative.
>>>
>>> Fair enough.
>>>
>>> I have thought a lot about this in the previous weeks and have been 
>>> flip-flopping on the issue. This is why I haven't written up a 
>>> better explanation of the problem previously -- I'm not yet 100% 
>>> sure what my own opinion on the issue is.
>>>
>>> Anyway, I will explain it here as good as I can. This is going to be 
>>> quite long, sorry about that.
>>>
>>>
>>> 1. Why the SQL query based approach is nice
>>> -------------------------------------------
>>>
>>> First, I definitely see the attraction of the SQL query based 
>>> approach. I understand this approach as: leveraging the expressivity 
>>> of SQL to do as much of the mapping/transformation as possible, with 
>>> some simple glue around it that essentially turns each SQL result 
>>> record into a few triples according to some simple rules.
>>>
>>> It's attractive because the approach leverages existing SQL 
>>> knowledge of mapping authors; it maximises expressivity; it means we 
>>> don't have to specify a large chunk of the problem ourselves; it 
>>> produces syntactically compact mappings. So, purely from an 
>>> authoring point of view it is definitely a nicer approach than any 
>>> of the proposed alternatives (D2RQ, Virtuoso RDF views, R2O etc).
>>>
>>> In order to run SPARQL queries against such a mapped database, one 
>>> would use the "triple view" approach, as detailed in Juan's work. So 
>>> the SPARQL-to-SQL engine would create a single view in the DB which 
>>> consists of lots of unions and in the end contains one row for each 
>>> mapped triple, with subject, predicate and object. How to run SPARQL 
>>> queries against such a relational structure is well-known from prior 
>>> work on database-backed triple stores. The result is a humongous SQL 
>>> query over a humongous view definition, but as Juan has shown, good 
>>> SQL optimizers can simplify this into a reasonable query plan.
>>>
>>> So here is why I argue against this approach.
>>>
>>>
>>> 2. Why the SQL query based approach fails in some cases
>>> -------------------------------------------------------
>>>
>>> First, I assume read-only access to the database. I cannot create 
>>> custom views. So, to run SPARQL queries with the approach above, I'd 
>>> have to use sub-SELECTs rather than views, which in theory should 
>>> work just fine and should be an implementation detail.
>>>
>>> But second, I assume that we use the query optimizer of MySQL, which 
>>> is unable to simplify the humongous SQL query from the approach 
>>> described above into something that runs in acceptable runtime (as I 
>>> demonstrated in [1]).
>>>
>>> Now if you happen to work for Oracle then you might say, "well they 
>>> should just use a real database." We can all chuckle about that for 
>>> a minute and then get back to business. There are existing systems, 
>>> such as D2RQ, that, whatever their limitations, produce decent 
>>> performance of MySQL and other "lesser" database engines. This group 
>>> *has* to standardise on a solution that is implementable on such 
>>> engines.
>>>
>>> So, how do we get acceptable performance on MySQL and other "lesser" 
>>> RDBMS, if we cannot use the "triple view" or "triple subselect" 
>>> approach?
>>>
>>> Well, we cannot translate SPARQL queries into humongous SQL queries 
>>> and then rely on the DB engine to simplify it so it runs in a 
>>> reasonable time. We have to be smarter in the translation, and 
>>> create SQL queries that are reasonably optimised straight away. I 
>>> will not get into the details, which are complicated, but it means 
>>> we can no longer treat the mapping's SQL queries as opaque blobs of 
>>> SQL text that we can just pass to the DB without looking at them -- 
>>> we have to dive into the SQL queries that define the mapping, 
>>> analyse what they are doing, and take them apart.
>>>
>>>
>>> 3. How has this problem been solved in practice to date?
>>> --------------------------------------------------------
>>>
>>> Here is the "worse is better" approach to solve this problem: We can 
>>> ask the *mapping author* to do the work for us and decompose the SQL 
>>> query into simpler elements (join conditions, projection 
>>> expressions, selection conditions and so on) and explain how they 
>>> relate to each other through the structure of the mapping file. Then 
>>> the SPARQL-to-SQL translation engine can build the optimised SQL 
>>> query straight from these simpler SQL fragments. This is what is 
>>> done in the D2RQ mapping language (see [2]).
>>>
>>> It is noteworthy that, to my knowledge, *every* RDB2RDF system to 
>>> date that supports the evaluation of SPARQL queries over mapped 
>>> databases, and assumes read-only access to the database, has opted 
>>> for an approach similar to this: D2RQ, OpenLink Virtuoso, 
>>> SquirrelRDF, R2O. None of their mapping languages specify the 
>>> mapping using complete SQL queries; all languages decompose the 
>>> queries into small chunks.
>>>
>>> To the best of my knowledge, there is *no* existing implementation 
>>> that supports SPARQL over the mapped database, supports read-only 
>>> access, and uses a mapping language based on the SQL query approach. 
>>> There are implementations of the SQL query approach that allow RDF 
>>> dumps of a mappded database (e.g., D2R Map) or resource-based linked 
>>> data style access (e.g., Triplify). But supporting SPARQL queries 
>>> over the mapped database is a task that is a whole lot more difficult.
>>>
>>>
>>> 4. How can we save the SQL query based approach?
>>> ------------------------------------------------
>>>
>>> So AFAIK no one has implemented the SQL query approach to support 
>>> SPARQL queries over mapped databases. It doesn't necessarily follow 
>>> that it's impossible, or even a bad idea. Could we specify our 
>>> mappings using arbitrary SQL queries, then translate SPARQL queries 
>>> over those mappings to SQL, and still end up with reasonably 
>>> optimised SQL queries?
>>>
>>> If this is possible at reasonable implementation cost, then it would 
>>> be a great way forward.
>>>
>>> I can imagine two approaches.
>>>
>>> First, you could develop your own custom SQL optimizer that takes 
>>> the humongous SQL query resulting from the triple view approach and 
>>> optimizes it to make the DB engine happy. I assert without proof 
>>> that the implementation cost for this is prohibitive, especially 
>>> because one has to create a different SQL optimiser for each 
>>> imperfect database engine that one wants to support (because their 
>>> native optimisers have different weaknesses, and because their SQL 
>>> dialects differ).
>>>
>>> Second approach: Do not allow arbitrary SQL queries in the mapping 
>>> language, but only a restricted subset. Then write a SQL parser that 
>>> is just smart enough to chop these restricted SQL queries into their 
>>> elements (such as join conditions, projection expressions, selection 
>>> conditions and so on).
>>>
>>> So, while the existing implementations (D2RQ, Virtuoso, etc) ask the 
>>> mapping author to do the job of decomposing the query into simpler 
>>> elements as part of the process of writing a mapping, we would now 
>>> have a parser that does the same job -- its input is a restricted 
>>> SQL query and its output are those simpler elements.
>>>
>>> In practice, this will not be as simple as it might sound. It 
>>> appears that one of the design goals of SQL was to make parser 
>>> implementation as difficult as possible. This is compounded by the 
>>> many differences between SQL dialects.
>>>
>>> Nevertheless, this approach seems promising, and it *might* be a way 
>>> of supporting SPARQL queries on MySQL and other "lesser" DB engines, 
>>> over a mapping language that uses the SQL query based approach.
>>>
>>>
>>> 5. Request for an existence proof
>>> ---------------------------------
>>>
>>> It seems that the proponents of the SQL query based approach fall 
>>> into two camps:
>>>
>>> 1. Those whose plan to rely on their DB engine's great optimizer for 
>>> doing all the hard work, and don't care wether it works on other 
>>> databases
>>>
>>> 2. Those who have not really been hit by the practicalities of 
>>> implementing a SPARQL engine over such a mapping when no good SQL 
>>> optimizer is available
>>>
>>> Let me repeat that I believe that the SQL query based approach is 
>>> better than the alternatives on almost every scale. The only problem 
>>> is that it has not been shown that it can be implemented at 
>>> reasonable cost in the absence of an advanced SQL optimizer. My 
>>> concern is this: If the group standardises an approach that is only 
>>> implementable on Oracle and SQL Server, then the group has failed. I 
>>> hope that there is consensus on this question; if not, better bring 
>>> it on the table NOW.
>>>
>>>
>>> If there was any implementation that used the SQL query based 
>>> approach as a mapping language,
>>> parsed the mapping's SQL queries, and translates SPARQL queries into 
>>> SQL queries that are significantly simpler than the humongous SQL 
>>> queries produced by the "triple view" approach, then I'd be a whole 
>>> lot more confident that the SQL query based approach ban be made to 
>>> work on databases such as MySQL.
>>>
>>> A good benchmark might be self-joins. Can you translate SPARQL 
>>> queries over the mapped DB into SQL queries that don't contain 
>>> self-joins (joining a table to itself on the PK)?
>>>
>>> So, how would you solve this? Can you make the SQL based approach 
>>> work without an awesome SQL optimizer? What if you have to support 
>>> multiple SQL dialects?
>>>
>>> Best,
>>> Richard
>>>
>>>
>>> [1] http://www.w3.org/2001/sw/rdb2rdf/wiki/PotentialSQLIssues
>>> [2] http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/#specification
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> - Souri.
>>>>
>>>
>>>
>
>
Received on Tuesday, 23 March 2010 14:34:24 UTC