Re: RDB2RDF mapping: Do we really need any alternative to use of SQL queries with conventions and a "trivial" mapping language? from Souripriya Das on 2010-03-23 (public-rdb2rdf-wg@w3.org from March 2010)

From: Souripriya Das <SOURIPRIYA.DAS@oracle.com>
Date: Mon, 22 Mar 2010 20:19:41 -0700 (PDT)
To: <richard@cyganiak.de>
Cc: <public-rdb2rdf-wg@w3.org>
Message-ID: <2e02e1af-40b2-4b2f-aec6-1f8057dafc8e@default>
Richard,

This is from an old prototype where the table emp (not a query like 'select ... from emp') has been mapped to RDF.

The following graph-pattern (BGP)

{ ?x  <http://im/USER1/EMP/EMPNO> ?eno .  ?x  <http://im/USER1/EMP/ENAME> ?mname .
  ?y  <http://im/USER1/EMP/MGR>   ?eno .  ?y <http://im/USER1/EMP/ENAME> ?ename }

was translated to

SELECT 
...
FROM (
SELECT   T0.ENAME as MNAME,T1.ENAME as ENAME,T0.EMPNO as ENO
FROM EMP T0, EMP T1 
 WHERE T0.EMPNO = T1.MGR
) R
WHERE 1=1 

Thanks,
- Souri.

----- Original Message -----
From: richard@cyganiak.de
To: SOURIPRIYA.DAS@oracle.com
Cc: public-rdb2rdf-wg@w3.org
Sent: Monday, March 22, 2010 6:59:00 PM GMT -05:00 US/Canada Eastern
Subject: Re: RDB2RDF mapping: Do we really need any alternative to use of SQL queries with conventions and a "trivial" mapping language?

Souri,

On 22 Mar 2010, at 21:44, Souripriya Das wrote:
> What you said about "triple view" makes sense. However, as I have  
> indicated a few times during the telcons and also during our  
> presentation:
>   we do not use the "triple view" approach with SQL.

I'm glad to hear that ;-)

I think you misread some of my message because I didn't sufficiently  
clarify my terms. We have to be careful wether we are talking about  
approaches to writing down RDB-to-RDF mappings, or about approaches to  
implement SPARQL-to-SQL translations over such mappings. These are  
almost orthogonal issues.

When I talk about UltraWrap's “triple view” approach, then I talk  
about an approach to implementing SPARQL-to-SQL translation. After  
all, that's what UltraWrap is all about -- it doesn't have a mapping  
language at all AFAICT.

One can generate a triple view from any RDB-to-RDF mapping  
specification, no matter what language the mapping is expressed in  
(well, if it doesn't exceed the expressivity of SQL).

> Instead we use the "relational query" approach. Thus the resulting  
> logical tables are not restricted to having only three (sub/pred/ 
> obj) columns. Each such logical table can have as many columns as  
> needed.

I do well understand how you are writing down RDB-to-RDF mappings for  
your system. But I do not know how your system implements SPARQL-to- 
SQL translation.

> (Thus, our approach is very similar to Triplify's approach and quite  
> different from UltraWrap's approach.)

Do you mean, similar/different in terms of the mapping specification  
or in terms of the implementation of SPARQL-to-SQL rewriting? I mean,  
UltraWrap has no mapping specification, and Triplify has no SPARQL-to- 
SQL rewriting, so I'm confused by your statement.

> With our approach, SPARQL-to-SQL translation generates SQL that is  
> pretty concise -- definitely not "humongous" -- and also fairly easy  
> to optimize (that can be accomplished even by "lesser" SQL  
> optimizers :-)).

Can I see some documentation about your approach to SPARQL-to-SQL  
rewriting?

Especially: Do you create views from the SQL queries in the mapping?

And do you parse the SQL queries in the mapping or does your system  
treat them as opaque?

Is your system able to generate SQL queries that do not have self- 
joins over the “logical tables”?

Cheers,
Richard



>
> Thanks,
> - Souri.
>
> ----- Original Message -----
> From: richard@cyganiak.de
> To: souripriya.das@oracle.com
> Cc: public-rdb2rdf-wg@w3.org
> Sent: Monday, March 22, 2010 1:40:09 PM GMT -05:00 US/Canada Eastern
> Subject: Re: RDB2RDF mapping: Do we really need any alternative to  
> use of SQL  queries with conventions and a "trivial" mapping language?
>
> On 22 Mar 2010, at 06:03, Souripriya Das wrote:
>> So far I have not seen or heard any convincing arguments to
>> establish that we need anything more than SQL and a "trivial"
>> mapping language. Before going for an alternative, we must first
>> establish the need for such an alternative.
>
> Fair enough.
>
> I have thought a lot about this in the previous weeks and have been
> flip-flopping on the issue. This is why I haven't written up a better
> explanation of the problem previously -- I'm not yet 100% sure what my
> own opinion on the issue is.
>
> Anyway, I will explain it here as good as I can. This is going to be
> quite long, sorry about that.
>
>
> 1. Why the SQL query based approach is nice
> -------------------------------------------
>
> First, I definitely see the attraction of the SQL query based
> approach. I understand this approach as: leveraging the expressivity
> of SQL to do as much of the mapping/transformation as possible, with
> some simple glue around it that essentially turns each SQL result
> record into a few triples according to some simple rules.
>
> It's attractive because the approach leverages existing SQL knowledge
> of mapping authors; it maximises expressivity; it means we don't have
> to specify a large chunk of the problem ourselves; it produces
> syntactically compact mappings. So, purely from an authoring point of
> view it is definitely a nicer approach than any of the proposed
> alternatives (D2RQ, Virtuoso RDF views, R2O etc).
>
> In order to run SPARQL queries against such a mapped database, one
> would use the “triple view” approach, as detailed in Juan's work. So
> the SPARQL-to-SQL engine would create a single view in the DB which
> consists of lots of unions and in the end contains one row for each
> mapped triple, with subject, predicate and object. How to run SPARQL
> queries against such a relational structure is well-known from prior
> work on database-backed triple stores. The result is a humongous SQL
> query over a humongous view definition, but as Juan has shown, good
> SQL optimizers can simplify this into a reasonable query plan.
>
> So here is why I argue against this approach.
>
>
> 2. Why the SQL query based approach fails in some cases
> -------------------------------------------------------
>
> First, I assume read-only access to the database. I cannot create
> custom views. So, to run SPARQL queries with the approach above, I'd
> have to use sub-SELECTs rather than views, which in theory should work
> just fine and should be an implementation detail.
>
> But second, I assume that we use the query optimizer of MySQL, which
> is unable to simplify the humongous SQL query from the approach
> described above into something that runs in acceptable runtime (as I
> demonstrated in [1]).
>
> Now if you happen to work for Oracle then you might say, “well they
> should just use a real database.” We can all chuckle about that for a
> minute and then get back to business. There are existing systems, such
> as D2RQ, that, whatever their limitations, produce decent performance
> of MySQL and other “lesser” database engines. This group *has* to
> standardise on a solution that is implementable on such engines.
>
> So, how do we get acceptable performance on MySQL and other “lesser”
> RDBMS, if we cannot use the “triple view” or “triple subselect”
> approach?
>
> Well, we cannot translate SPARQL queries into humongous SQL queries
> and then rely on the DB engine to simplify it so it runs in a
> reasonable time. We have to be smarter in the translation, and create
> SQL queries that are reasonably optimised straight away. I will not
> get into the details, which are complicated, but it means we can no
> longer treat the mapping's SQL queries as opaque blobs of SQL text
> that we can just pass to the DB without looking at them -- we have to
> dive into the SQL queries that define the mapping, analyse what they
> are doing, and take them apart.
>
>
> 3. How has this problem been solved in practice to date?
> --------------------------------------------------------
>
> Here is the “worse is better” approach to solve this problem: We can
> ask the *mapping author* to do the work for us and decompose the SQL
> query into simpler elements (join conditions, projection expressions,
> selection conditions and so on) and explain how they relate to each
> other through the structure of the mapping file. Then the SPARQL-to-
> SQL translation engine can build the optimised SQL query straight from
> these simpler SQL fragments. This is what is done in the D2RQ mapping
> language (see [2]).
>
> It is noteworthy that, to my knowledge, *every* RDB2RDF system to date
> that supports the evaluation of SPARQL queries over mapped databases,
> and assumes read-only access to the database, has opted for an
> approach similar to this: D2RQ, OpenLink Virtuoso, SquirrelRDF, R2O.
> None of their mapping languages specify the mapping using complete SQL
> queries; all languages decompose the queries into small chunks.
>
> To the best of my knowledge, there is *no* existing implementation
> that supports SPARQL over the mapped database, supports read-only
> access, and uses a mapping language based on the SQL query approach.
> There are implementations of the SQL query approach that allow RDF
> dumps of a mappded database (e.g., D2R Map) or resource-based linked
> data style access (e.g., Triplify). But supporting SPARQL queries over
> the mapped database is a task that is a whole lot more difficult.
>
>
> 4. How can we save the SQL query based approach?
> ------------------------------------------------
>
> So AFAIK no one has implemented the SQL query approach to support
> SPARQL queries over mapped databases. It doesn't necessarily follow
> that it's impossible, or even a bad idea. Could we specify our
> mappings using arbitrary SQL queries, then translate SPARQL queries
> over those mappings to SQL, and still end up with reasonably optimised
> SQL queries?
>
> If this is possible at reasonable implementation cost, then it would
> be a great way forward.
>
> I can imagine two approaches.
>
> First, you could develop your own custom SQL optimizer that takes the
> humongous SQL query resulting from the triple view approach and
> optimizes it to make the DB engine happy. I assert without proof that
> the implementation cost for this is prohibitive, especially because
> one has to create a different SQL optimiser for each imperfect
> database engine that one wants to support (because their native
> optimisers have different weaknesses, and because their SQL dialects
> differ).
>
> Second approach: Do not allow arbitrary SQL queries in the mapping
> language, but only a restricted subset. Then write a SQL parser that
> is just smart enough to chop these restricted SQL queries into their
> elements (such as join conditions, projection expressions, selection
> conditions and so on).
>
> So, while the existing implementations (D2RQ, Virtuoso, etc) ask the
> mapping author to do the job of decomposing the query into simpler
> elements as part of the process of writing a mapping, we would now
> have a parser that does the same job -- its input is a restricted SQL
> query and its output are those simpler elements.
>
> In practice, this will not be as simple as it might sound. It appears
> that one of the design goals of SQL was to make parser implementation
> as difficult as possible. This is compounded by the many differences
> between SQL dialects.
>
> Nevertheless, this approach seems promising, and it *might* be a way
> of supporting SPARQL queries on MySQL and other “lesser” DB engines,
> over a mapping language that uses the SQL query based approach.
>
>
> 5. Request for an existence proof
> ---------------------------------
>
> It seems that the proponents of the SQL query based approach fall into
> two camps:
>
> 1. Those whose plan to rely on their DB engine's great optimizer for
> doing all the hard work, and don't care wether it works on other
> databases
>
> 2. Those who have not really been hit by the practicalities of
> implementing a SPARQL engine over such a mapping when no good SQL
> optimizer is available
>
> Let me repeat that I believe that the SQL query based approach is
> better than the alternatives on almost every scale. The only problem
> is that it has not been shown that it can be implemented at reasonable
> cost in the absence of an advanced SQL optimizer. My concern is this:
> If the group standardises an approach that is only implementable on
> Oracle and SQL Server, then the group has failed. I hope that there is
> consensus on this question; if not, better bring it on the table NOW.
>
>
> If there was any implementation that used the SQL query based approach
> as a mapping language,
> parsed the mapping's SQL queries, and translates SPARQL queries into
> SQL queries that are significantly simpler than the humongous SQL
> queries produced by the “triple view” approach, then I'd be a whole
> lot more confident that the SQL query based approach ban be made to
> work on databases such as MySQL.
>
> A good benchmark might be self-joins. Can you translate SPARQL queries
> over the mapped DB into SQL queries that don't contain self-joins
> (joining a table to itself on the PK)?
>
> So, how would you solve this? Can you make the SQL based approach work
> without an awesome SQL optimizer? What if you have to support multiple
> SQL dialects?
>
> Best,
> Richard
>
>
> [1] http://www.w3.org/2001/sw/rdb2rdf/wiki/PotentialSQLIssues
> [2] http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/#specification
>
>
>
>>
>> Thanks,
>> - Souri.
>>
>
>
Received on Tuesday, 23 March 2010 03:22:04 UTC