Re: RDB2RDF mapping: Do we really need any alternative to use of SQL queries with conventions and a "trivial" mapping language? from ashok malhotra on 2010-03-22 (public-rdb2rdf-wg@w3.org from March 2010)

From: ashok malhotra <ashok.malhotra@oracle.com>
Date: Mon, 22 Mar 2010 15:03:36 -0700
To: Richard Cyganiak <richard@cyganiak.de>
CC: Souripriya Das <souripriya.das@oracle.com>, Public-Rdb2rdf-Wg <public-rdb2rdf-wg@w3.org>
Message-ID: <4BA7E938.10209@oracle.com>
I think it's a poor argument to say that because MySQL does a poor job 
with sub-selects
we should abandon the SQL-view approach.  As I said in my earlier note
Pl. take a look at 
http://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html
which explains how to rewrite subqueries as joins.

Also, the focus of this WG is on mapping Relational Data to RDF, not on 
optimizing SPARQL
queries.  I think this is a very interesting area in which various 
vendors will compete and some
approaches and some databases will do better than others.
All the best, Ashok


Richard Cyganiak wrote:
> On 22 Mar 2010, at 06:03, Souripriya Das wrote:
>> So far I have not seen or heard any convincing arguments to establish 
>> that we need anything more than SQL and a "trivial" mapping language. 
>> Before going for an alternative, we must first establish the need for 
>> such an alternative.
>
> Fair enough.
>
> I have thought a lot about this in the previous weeks and have been 
> flip-flopping on the issue. This is why I haven't written up a better 
> explanation of the problem previously -- I'm not yet 100% sure what my 
> own opinion on the issue is.
>
> Anyway, I will explain it here as good as I can. This is going to be 
> quite long, sorry about that.
>
>
> 1. Why the SQL query based approach is nice
> -------------------------------------------
>
> First, I definitely see the attraction of the SQL query based 
> approach. I understand this approach as: leveraging the expressivity 
> of SQL to do as much of the mapping/transformation as possible, with 
> some simple glue around it that essentially turns each SQL result 
> record into a few triples according to some simple rules.
>
> It's attractive because the approach leverages existing SQL knowledge 
> of mapping authors; it maximises expressivity; it means we don't have 
> to specify a large chunk of the problem ourselves; it produces 
> syntactically compact mappings. So, purely from an authoring point of 
> view it is definitely a nicer approach than any of the proposed 
> alternatives (D2RQ, Virtuoso RDF views, R2O etc).
>
> In order to run SPARQL queries against such a mapped database, one 
> would use the “triple view” approach, as detailed in Juan's work. So 
> the SPARQL-to-SQL engine would create a single view in the DB which 
> consists of lots of unions and in the end contains one row for each 
> mapped triple, with subject, predicate and object. How to run SPARQL 
> queries against such a relational structure is well-known from prior 
> work on database-backed triple stores. The result is a humongous SQL 
> query over a humongous view definition, but as Juan has shown, good 
> SQL optimizers can simplify this into a reasonable query plan.
>
> So here is why I argue against this approach.
>
>
> 2. Why the SQL query based approach fails in some cases
> -------------------------------------------------------
>
> First, I assume read-only access to the database. I cannot create 
> custom views. So, to run SPARQL queries with the approach above, I'd 
> have to use sub-SELECTs rather than views, which in theory should work 
> just fine and should be an implementation detail.
>
> But second, I assume that we use the query optimizer of MySQL, which 
> is unable to simplify the humongous SQL query from the approach 
> described above into something that runs in acceptable runtime (as I 
> demonstrated in [1]).
>
> Now if you happen to work for Oracle then you might say, “well they 
> should just use a real database.” We can all chuckle about that for a 
> minute and then get back to business. There are existing systems, such 
> as D2RQ, that, whatever their limitations, produce decent performance 
> of MySQL and other “lesser” database engines. This group *has* to 
> standardise on a solution that is implementable on such engines.
>
> So, how do we get acceptable performance on MySQL and other “lesser” 
> RDBMS, if we cannot use the “triple view” or “triple subselect” approach?
>
> Well, we cannot translate SPARQL queries into humongous SQL queries 
> and then rely on the DB engine to simplify it so it runs in a 
> reasonable time. We have to be smarter in the translation, and create 
> SQL queries that are reasonably optimised straight away. I will not 
> get into the details, which are complicated, but it means we can no 
> longer treat the mapping's SQL queries as opaque blobs of SQL text 
> that we can just pass to the DB without looking at them -- we have to 
> dive into the SQL queries that define the mapping, analyse what they 
> are doing, and take them apart.
>
>
> 3. How has this problem been solved in practice to date?
> --------------------------------------------------------
>
> Here is the “worse is better” approach to solve this problem: We can 
> ask the *mapping author* to do the work for us and decompose the SQL 
> query into simpler elements (join conditions, projection expressions, 
> selection conditions and so on) and explain how they relate to each 
> other through the structure of the mapping file. Then the 
> SPARQL-to-SQL translation engine can build the optimised SQL query 
> straight from these simpler SQL fragments. This is what is done in the 
> D2RQ mapping language (see [2]).
>
> It is noteworthy that, to my knowledge, *every* RDB2RDF system to date 
> that supports the evaluation of SPARQL queries over mapped databases, 
> and assumes read-only access to the database, has opted for an 
> approach similar to this: D2RQ, OpenLink Virtuoso, SquirrelRDF, R2O. 
> None of their mapping languages specify the mapping using complete SQL 
> queries; all languages decompose the queries into small chunks.
>
> To the best of my knowledge, there is *no* existing implementation 
> that supports SPARQL over the mapped database, supports read-only 
> access, and uses a mapping language based on the SQL query approach. 
> There are implementations of the SQL query approach that allow RDF 
> dumps of a mappded database (e.g., D2R Map) or resource-based linked 
> data style access (e.g., Triplify). But supporting SPARQL queries over 
> the mapped database is a task that is a whole lot more difficult.
>
>
> 4. How can we save the SQL query based approach?
> ------------------------------------------------
>
> So AFAIK no one has implemented the SQL query approach to support 
> SPARQL queries over mapped databases. It doesn't necessarily follow 
> that it's impossible, or even a bad idea. Could we specify our 
> mappings using arbitrary SQL queries, then translate SPARQL queries 
> over those mappings to SQL, and still end up with reasonably optimised 
> SQL queries?
>
> If this is possible at reasonable implementation cost, then it would 
> be a great way forward.
>
> I can imagine two approaches.
>
> First, you could develop your own custom SQL optimizer that takes the 
> humongous SQL query resulting from the triple view approach and 
> optimizes it to make the DB engine happy. I assert without proof that 
> the implementation cost for this is prohibitive, especially because 
> one has to create a different SQL optimiser for each imperfect 
> database engine that one wants to support (because their native 
> optimisers have different weaknesses, and because their SQL dialects 
> differ).
>
> Second approach: Do not allow arbitrary SQL queries in the mapping 
> language, but only a restricted subset. Then write a SQL parser that 
> is just smart enough to chop these restricted SQL queries into their 
> elements (such as join conditions, projection expressions, selection 
> conditions and so on).
>
> So, while the existing implementations (D2RQ, Virtuoso, etc) ask the 
> mapping author to do the job of decomposing the query into simpler 
> elements as part of the process of writing a mapping, we would now 
> have a parser that does the same job -- its input is a restricted SQL 
> query and its output are those simpler elements.
>
> In practice, this will not be as simple as it might sound. It appears 
> that one of the design goals of SQL was to make parser implementation 
> as difficult as possible. This is compounded by the many differences 
> between SQL dialects.
>
> Nevertheless, this approach seems promising, and it *might* be a way 
> of supporting SPARQL queries on MySQL and other “lesser” DB engines, 
> over a mapping language that uses the SQL query based approach.
>
>
> 5. Request for an existence proof
> ---------------------------------
>
> It seems that the proponents of the SQL query based approach fall into 
> two camps:
>
> 1. Those whose plan to rely on their DB engine's great optimizer for 
> doing all the hard work, and don't care wether it works on other 
> databases
>
> 2. Those who have not really been hit by the practicalities of 
> implementing a SPARQL engine over such a mapping when no good SQL 
> optimizer is available
>
> Let me repeat that I believe that the SQL query based approach is 
> better than the alternatives on almost every scale. The only problem 
> is that it has not been shown that it can be implemented at reasonable 
> cost in the absence of an advanced SQL optimizer. My concern is this: 
> If the group standardises an approach that is only implementable on 
> Oracle and SQL Server, then the group has failed. I hope that there is 
> consensus on this question; if not, better bring it on the table NOW.
>
>
> If there was any implementation that used the SQL query based approach 
> as a mapping language,
> parsed the mapping's SQL queries, and translates SPARQL queries into 
> SQL queries that are significantly simpler than the humongous SQL 
> queries produced by the “triple view” approach, then I'd be a whole 
> lot more confident that the SQL query based approach ban be made to 
> work on databases such as MySQL.
>
> A good benchmark might be self-joins. Can you translate SPARQL queries 
> over the mapped DB into SQL queries that don't contain self-joins 
> (joining a table to itself on the PK)?
>
> So, how would you solve this? Can you make the SQL based approach work 
> without an awesome SQL optimizer? What if you have to support multiple 
> SQL dialects?
>
> Best,
> Richard
>
>
> [1] http://www.w3.org/2001/sw/rdb2rdf/wiki/PotentialSQLIssues
> [2] http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/#specification
>
>
>
>>
>> Thanks,
>> - Souri.
>>
>
>
Received on Monday, 22 March 2010 22:04:52 UTC