[Fwd: [Fwd: Re: RDB2RDF mapping: Do we really need any alternative to use of SQL queries with conventions and a "trivial" mapping language?]] from Souri Das on 2010-03-23 (public-rdb2rdf-wg@w3.org from March 2010)

From: Souri Das <Souripriya.Das@Oracle.com>
Date: Tue, 23 Mar 2010 12:03:36 -0400
To: RDB2RDF WG <public-rdb2rdf-wg@w3.org>
CC: Seema Sundara <seema.sundara@Oracle.com>
Message-ID: <4BA8E658.5070801@Oracle.com>
To address some of the questions raised during the last couple of days 
about query translation, we have added a Part 3 showing a SPARQL graph 
pattern and corresponding SQL translation using the SQLdefString 
(queries) in the WITH clause.

    * Example of SQL-Query based Approach
      <http://www.w3.org/2001/sw/rdb2rdf/wiki/Example_of_SQL-Query_based_Approach>

          o Part 1: Schema => RDB Schema, RDB2RDF Mapping, and generated
            RDF Schema
            <http://www.w3.org/2001/sw/rdb2rdf/wiki/Example_of_SQL-Query_based_Approach_%28Part_1:_Schema%29:_RDB_Schema%2C_RDB2RDF_Mapping%2C_and_generated_RDF_Schema>

          o Part 2: Data => RDB (relational) Data and corresponding
            (virtual) RDF Graphs
            <http://www.w3.org/2001/sw/rdb2rdf/wiki/Example_of_SQL-Query_based_Approach_%28Part_2:_Data%29:_RDB_%28relational%29_Data_and_corresponding_%28virtual%29_RDF_Graphs>

          o Part 3: Query => SPARQL Query and its SQL translation
            <http://www.w3.org/2001/sw/rdb2rdf/wiki/Example_of_SQL-Query_based_Approach_%28Part_3:_Query%29:_SPARQL_Query_and_its_SQL_translation>


Thanks,
Souri & Seema

-------- Original Message --------
Subject: 	[Fwd: Re: RDB2RDF mapping: Do we really need any alternative 
to use of SQL queries with conventions and a "trivial" mapping language?]
Resent-Date: 	Tue, 23 Mar 2010 14:20:18 +0000
Resent-From: 	public-rdb2rdf-wg@w3.org
Date: 	Tue, 23 Mar 2010 10:19:38 -0400
From: 	Souri Das <Souripriya.Das@oracle.com>
To: 	RDB2RDF WG <public-rdb2rdf-wg@w3.org>



forgot to cc to the WG ...

-------- Original Message --------
Subject: 	Re: RDB2RDF mapping: Do we really need any alternative to use 
of SQL queries with conventions and a "trivial" mapping language?
Date: 	Tue, 23 Mar 2010 10:07:07 -0400
From: 	Souri Das <Souripriya.Das@Oracle.com>
To: 	Richard Cyganiak <richard@cyganiak.de>
References: 	<2e02e1af-40b2-4b2f-aec6-1f8057dafc8e@default> 
<82E10043-E8CC-4DDA-8690-95312F626330@cyganiak.de>



Richard,

Here are my answers to your questions:

> ... if you wanted to use a query like 'select ... from emp where ...' 
instead of just the table emp, how would your implementation handle this?

> Q1>> Would it create a view for 'select ... from emp where ...'?
No.

Q2>> Would it repeat the 'select ... from emp where ...' several times 
as sub-selects?
No. Generated SQL will use the WITH clause:
WITH (<subq> AS (select ... from emp where ...)) SELECT ... FROM <subq> 
t ... WHERE ....
Optimizer works fine with WITH clause. That is, it is able to see the 
underlying tables and relevant indexes for use in the execution plan.

> Q3>> Would it parse the 'select ... from emp where ...' into its 
individual projection clauses and selection conditions in order to 
flatten the subqueries and eliminate unnecessary joins?
In the prototype we do not parse the SQLdefString (i.e., 'select ... 
from emp where ...'). But, the SPARQL-to-SQL translation analyzes the 
SPARQL graph pattern and generates joins only if necessary (as I showed 
in the simple example where instead of generating a 4-way join, we are 
able to generate a 2-way join of the EMP table).

> The point of my long message was that the first isn't an option when 
you can't modify the database, the second isn't an option when you're on 
a “lesser” database, and the third to my knowledge has not been shown to 
be implementable to date. Would be great to hear that you have 
implemented the third and to learn some details.

Thanks,
- Souri.

Richard Cyganiak wrote:
> Souri,
>
> On 23 Mar 2010, at 03:19, Souripriya Das wrote:
>> This is from an old prototype where the table emp (not a query like 
>> 'select ... from emp') has been mapped to RDF.
>
> Thanks, looks good so far. Now if you wanted to use a query like 
> 'select ... from emp where ...' instead of just the table emp, how 
> would your implementation handle this?
>
> Would it create a view for 'select ... from emp where ...'?
>
> Would it repeat the 'select ... from emp where ...' several times as 
> sub-selects?
>
> Would it parse the 'select ... from emp where ...' into its individual 
> projection clauses and selection conditions in order to flatten the 
> subqueries and eliminate unnecessary joins?
>
> The point of my long message was that the first isn't an option when 
> you can't modify the database, the second isn't an option when you're 
> on a “lesser” database, and the third to my knowledge has not been 
> shown to be implementable to date. Would be great to hear that you 
> have implemented the third and to learn some details.
>
> Thanks,
> Richard
>
>
>>
>> The following graph-pattern (BGP)
>>
>> { ?x  <http://im/USER1/EMP/EMPNO> ?eno .  ?x  
>> <http://im/USER1/EMP/ENAME> ?mname .
>>  ?y  <http://im/USER1/EMP/MGR>   ?eno .  ?y 
>> <http://im/USER1/EMP/ENAME> ?ename }
>>
>> was translated to
>>
>> SELECT
>> ...
>> FROM (
>> SELECT   T0.ENAME as MNAME,T1.ENAME as ENAME,T0.EMPNO as ENO
>> FROM EMP T0, EMP T1
>> WHERE T0.EMPNO = T1.MGR
>> ) R
>> WHERE 1=1
>>
>> Thanks,
>> - Souri.
>>
>> ----- Original Message -----
>> From: richard@cyganiak.de
>> To: SOURIPRIYA.DAS@oracle.com
>> Cc: public-rdb2rdf-wg@w3.org
>> Sent: Monday, March 22, 2010 6:59:00 PM GMT -05:00 US/Canada Eastern
>> Subject: Re: RDB2RDF mapping: Do we really need any alternative to 
>> use of SQL queries with conventions and a "trivial" mapping language?
>>
>> Souri,
>>
>> On 22 Mar 2010, at 21:44, Souripriya Das wrote:
>>> What you said about "triple view" makes sense. However, as I have
>>> indicated a few times during the telcons and also during our
>>> presentation:
>>>  we do not use the "triple view" approach with SQL.
>>
>> I'm glad to hear that ;-)
>>
>> I think you misread some of my message because I didn't sufficiently
>> clarify my terms. We have to be careful wether we are talking about
>> approaches to writing down RDB-to-RDF mappings, or about approaches to
>> implement SPARQL-to-SQL translations over such mappings. These are
>> almost orthogonal issues.
>>
>> When I talk about UltraWrap's “triple view” approach, then I talk
>> about an approach to implementing SPARQL-to-SQL translation. After
>> all, that's what UltraWrap is all about -- it doesn't have a mapping
>> language at all AFAICT.
>>
>> One can generate a triple view from any RDB-to-RDF mapping
>> specification, no matter what language the mapping is expressed in
>> (well, if it doesn't exceed the expressivity of SQL).
>>
>>> Instead we use the "relational query" approach. Thus the resulting
>>> logical tables are not restricted to having only three (sub/pred/
>>> obj) columns. Each such logical table can have as many columns as
>>> needed.
>>
>> I do well understand how you are writing down RDB-to-RDF mappings for
>> your system. But I do not know how your system implements SPARQL-to-
>> SQL translation.
>>
>>> (Thus, our approach is very similar to Triplify's approach and quite
>>> different from UltraWrap's approach.)
>>
>> Do you mean, similar/different in terms of the mapping specification
>> or in terms of the implementation of SPARQL-to-SQL rewriting? I mean,
>> UltraWrap has no mapping specification, and Triplify has no SPARQL-to-
>> SQL rewriting, so I'm confused by your statement.
>>
>>> With our approach, SPARQL-to-SQL translation generates SQL that is
>>> pretty concise -- definitely not "humongous" -- and also fairly easy
>>> to optimize (that can be accomplished even by "lesser" SQL
>>> optimizers :-)).
>>
>> Can I see some documentation about your approach to SPARQL-to-SQL
>> rewriting?
>>
>> Especially: Do you create views from the SQL queries in the mapping?
>>
>> And do you parse the SQL queries in the mapping or does your system
>> treat them as opaque?
>>
>> Is your system able to generate SQL queries that do not have self-
>> joins over the “logical tables”?
>>
>> Cheers,
>> Richard
>>
>>
>>
>>>
>>> Thanks,
>>> - Souri.
>>>
>>> ----- Original Message -----
>>> From: richard@cyganiak.de
>>> To: souripriya.das@oracle.com
>>> Cc: public-rdb2rdf-wg@w3.org
>>> Sent: Monday, March 22, 2010 1:40:09 PM GMT -05:00 US/Canada Eastern
>>> Subject: Re: RDB2RDF mapping: Do we really need any alternative to
>>> use of SQL  queries with conventions and a "trivial" mapping language?
>>>
>>> On 22 Mar 2010, at 06:03, Souripriya Das wrote:
>>>> So far I have not seen or heard any convincing arguments to
>>>> establish that we need anything more than SQL and a "trivial"
>>>> mapping language. Before going for an alternative, we must first
>>>> establish the need for such an alternative.
>>>
>>> Fair enough.
>>>
>>> I have thought a lot about this in the previous weeks and have been
>>> flip-flopping on the issue. This is why I haven't written up a better
>>> explanation of the problem previously -- I'm not yet 100% sure what my
>>> own opinion on the issue is.
>>>
>>> Anyway, I will explain it here as good as I can. This is going to be
>>> quite long, sorry about that.
>>>
>>>
>>> 1. Why the SQL query based approach is nice
>>> -------------------------------------------
>>>
>>> First, I definitely see the attraction of the SQL query based
>>> approach. I understand this approach as: leveraging the expressivity
>>> of SQL to do as much of the mapping/transformation as possible, with
>>> some simple glue around it that essentially turns each SQL result
>>> record into a few triples according to some simple rules.
>>>
>>> It's attractive because the approach leverages existing SQL knowledge
>>> of mapping authors; it maximises expressivity; it means we don't have
>>> to specify a large chunk of the problem ourselves; it produces
>>> syntactically compact mappings. So, purely from an authoring point of
>>> view it is definitely a nicer approach than any of the proposed
>>> alternatives (D2RQ, Virtuoso RDF views, R2O etc).
>>>
>>> In order to run SPARQL queries against such a mapped database, one
>>> would use the “triple view” approach, as detailed in Juan's work. So
>>> the SPARQL-to-SQL engine would create a single view in the DB which
>>> consists of lots of unions and in the end contains one row for each
>>> mapped triple, with subject, predicate and object. How to run SPARQL
>>> queries against such a relational structure is well-known from prior
>>> work on database-backed triple stores. The result is a humongous SQL
>>> query over a humongous view definition, but as Juan has shown, good
>>> SQL optimizers can simplify this into a reasonable query plan.
>>>
>>> So here is why I argue against this approach.
>>>
>>>
>>> 2. Why the SQL query based approach fails in some cases
>>> -------------------------------------------------------
>>>
>>> First, I assume read-only access to the database. I cannot create
>>> custom views. So, to run SPARQL queries with the approach above, I'd
>>> have to use sub-SELECTs rather than views, which in theory should work
>>> just fine and should be an implementation detail.
>>>
>>> But second, I assume that we use the query optimizer of MySQL, which
>>> is unable to simplify the humongous SQL query from the approach
>>> described above into something that runs in acceptable runtime (as I
>>> demonstrated in [1]).
>>>
>>> Now if you happen to work for Oracle then you might say, “well they
>>> should just use a real database.” We can all chuckle about that for a
>>> minute and then get back to business. There are existing systems, such
>>> as D2RQ, that, whatever their limitations, produce decent performance
>>> of MySQL and other “lesser” database engines. This group *has* to
>>> standardise on a solution that is implementable on such engines.
>>>
>>> So, how do we get acceptable performance on MySQL and other “lesser”
>>> RDBMS, if we cannot use the “triple view” or “triple subselect”
>>> approach?
>>>
>>> Well, we cannot translate SPARQL queries into humongous SQL queries
>>> and then rely on the DB engine to simplify it so it runs in a
>>> reasonable time. We have to be smarter in the translation, and create
>>> SQL queries that are reasonably optimised straight away. I will not
>>> get into the details, which are complicated, but it means we can no
>>> longer treat the mapping's SQL queries as opaque blobs of SQL text
>>> that we can just pass to the DB without looking at them -- we have to
>>> dive into the SQL queries that define the mapping, analyse what they
>>> are doing, and take them apart.
>>>
>>>
>>> 3. How has this problem been solved in practice to date?
>>> --------------------------------------------------------
>>>
>>> Here is the “worse is better” approach to solve this problem: We can
>>> ask the *mapping author* to do the work for us and decompose the SQL
>>> query into simpler elements (join conditions, projection expressions,
>>> selection conditions and so on) and explain how they relate to each
>>> other through the structure of the mapping file. Then the SPARQL-to-
>>> SQL translation engine can build the optimised SQL query straight from
>>> these simpler SQL fragments. This is what is done in the D2RQ mapping
>>> language (see [2]).
>>>
>>> It is noteworthy that, to my knowledge, *every* RDB2RDF system to date
>>> that supports the evaluation of SPARQL queries over mapped databases,
>>> and assumes read-only access to the database, has opted for an
>>> approach similar to this: D2RQ, OpenLink Virtuoso, SquirrelRDF, R2O.
>>> None of their mapping languages specify the mapping using complete SQL
>>> queries; all languages decompose the queries into small chunks.
>>>
>>> To the best of my knowledge, there is *no* existing implementation
>>> that supports SPARQL over the mapped database, supports read-only
>>> access, and uses a mapping language based on the SQL query approach.
>>> There are implementations of the SQL query approach that allow RDF
>>> dumps of a mappded database (e.g., D2R Map) or resource-based linked
>>> data style access (e.g., Triplify). But supporting SPARQL queries over
>>> the mapped database is a task that is a whole lot more difficult.
>>>
>>>
>>> 4. How can we save the SQL query based approach?
>>> ------------------------------------------------
>>>
>>> So AFAIK no one has implemented the SQL query approach to support
>>> SPARQL queries over mapped databases. It doesn't necessarily follow
>>> that it's impossible, or even a bad idea. Could we specify our
>>> mappings using arbitrary SQL queries, then translate SPARQL queries
>>> over those mappings to SQL, and still end up with reasonably optimised
>>> SQL queries?
>>>
>>> If this is possible at reasonable implementation cost, then it would
>>> be a great way forward.
>>>
>>> I can imagine two approaches.
>>>
>>> First, you could develop your own custom SQL optimizer that takes the
>>> humongous SQL query resulting from the triple view approach and
>>> optimizes it to make the DB engine happy. I assert without proof that
>>> the implementation cost for this is prohibitive, especially because
>>> one has to create a different SQL optimiser for each imperfect
>>> database engine that one wants to support (because their native
>>> optimisers have different weaknesses, and because their SQL dialects
>>> differ).
>>>
>>> Second approach: Do not allow arbitrary SQL queries in the mapping
>>> language, but only a restricted subset. Then write a SQL parser that
>>> is just smart enough to chop these restricted SQL queries into their
>>> elements (such as join conditions, projection expressions, selection
>>> conditions and so on).
>>>
>>> So, while the existing implementations (D2RQ, Virtuoso, etc) ask the
>>> mapping author to do the job of decomposing the query into simpler
>>> elements as part of the process of writing a mapping, we would now
>>> have a parser that does the same job -- its input is a restricted SQL
>>> query and its output are those simpler elements.
>>>
>>> In practice, this will not be as simple as it might sound. It appears
>>> that one of the design goals of SQL was to make parser implementation
>>> as difficult as possible. This is compounded by the many differences
>>> between SQL dialects.
>>>
>>> Nevertheless, this approach seems promising, and it *might* be a way
>>> of supporting SPARQL queries on MySQL and other “lesser” DB engines,
>>> over a mapping language that uses the SQL query based approach.
>>>
>>>
>>> 5. Request for an existence proof
>>> ---------------------------------
>>>
>>> It seems that the proponents of the SQL query based approach fall into
>>> two camps:
>>>
>>> 1. Those whose plan to rely on their DB engine's great optimizer for
>>> doing all the hard work, and don't care wether it works on other
>>> databases
>>>
>>> 2. Those who have not really been hit by the practicalities of
>>> implementing a SPARQL engine over such a mapping when no good SQL
>>> optimizer is available
>>>
>>> Let me repeat that I believe that the SQL query based approach is
>>> better than the alternatives on almost every scale. The only problem
>>> is that it has not been shown that it can be implemented at reasonable
>>> cost in the absence of an advanced SQL optimizer. My concern is this:
>>> If the group standardises an approach that is only implementable on
>>> Oracle and SQL Server, then the group has failed. I hope that there is
>>> consensus on this question; if not, better bring it on the table NOW.
>>>
>>>
>>> If there was any implementation that used the SQL query based approach
>>> as a mapping language,
>>> parsed the mapping's SQL queries, and translates SPARQL queries into
>>> SQL queries that are significantly simpler than the humongous SQL
>>> queries produced by the “triple view” approach, then I'd be a whole
>>> lot more confident that the SQL query based approach ban be made to
>>> work on databases such as MySQL.
>>>
>>> A good benchmark might be self-joins. Can you translate SPARQL queries
>>> over the mapped DB into SQL queries that don't contain self-joins
>>> (joining a table to itself on the PK)?
>>>
>>> So, how would you solve this? Can you make the SQL based approach work
>>> without an awesome SQL optimizer? What if you have to support multiple
>>> SQL dialects?
>>>
>>> Best,
>>> Richard
>>>
>>>
>>> [1] http://www.w3.org/2001/sw/rdb2rdf/wiki/PotentialSQLIssues
>>> [2] http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/#specification
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> - Souri.
>>>>
>>>
>>>
>>
>
>
Received on Tuesday, 23 March 2010 16:05:09 UTC