- From: ashok malhotra <ashok.malhotra@oracle.com>
- Date: Mon, 22 Mar 2010 15:03:36 -0700
- To: Richard Cyganiak <richard@cyganiak.de>
- CC: Souripriya Das <souripriya.das@oracle.com>, Public-Rdb2rdf-Wg <public-rdb2rdf-wg@w3.org>
I think it's a poor argument to say that because MySQL does a poor job with sub-selects we should abandon the SQL-view approach. As I said in my earlier note Pl. take a look at http://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html which explains how to rewrite subqueries as joins. Also, the focus of this WG is on mapping Relational Data to RDF, not on optimizing SPARQL queries. I think this is a very interesting area in which various vendors will compete and some approaches and some databases will do better than others. All the best, Ashok Richard Cyganiak wrote: > On 22 Mar 2010, at 06:03, Souripriya Das wrote: >> So far I have not seen or heard any convincing arguments to establish >> that we need anything more than SQL and a "trivial" mapping language. >> Before going for an alternative, we must first establish the need for >> such an alternative. > > Fair enough. > > I have thought a lot about this in the previous weeks and have been > flip-flopping on the issue. This is why I haven't written up a better > explanation of the problem previously -- I'm not yet 100% sure what my > own opinion on the issue is. > > Anyway, I will explain it here as good as I can. This is going to be > quite long, sorry about that. > > > 1. Why the SQL query based approach is nice > ------------------------------------------- > > First, I definitely see the attraction of the SQL query based > approach. I understand this approach as: leveraging the expressivity > of SQL to do as much of the mapping/transformation as possible, with > some simple glue around it that essentially turns each SQL result > record into a few triples according to some simple rules. > > It's attractive because the approach leverages existing SQL knowledge > of mapping authors; it maximises expressivity; it means we don't have > to specify a large chunk of the problem ourselves; it produces > syntactically compact mappings. So, purely from an authoring point of > view it is definitely a nicer approach than any of the proposed > alternatives (D2RQ, Virtuoso RDF views, R2O etc). > > In order to run SPARQL queries against such a mapped database, one > would use the “triple view” approach, as detailed in Juan's work. So > the SPARQL-to-SQL engine would create a single view in the DB which > consists of lots of unions and in the end contains one row for each > mapped triple, with subject, predicate and object. How to run SPARQL > queries against such a relational structure is well-known from prior > work on database-backed triple stores. The result is a humongous SQL > query over a humongous view definition, but as Juan has shown, good > SQL optimizers can simplify this into a reasonable query plan. > > So here is why I argue against this approach. > > > 2. Why the SQL query based approach fails in some cases > ------------------------------------------------------- > > First, I assume read-only access to the database. I cannot create > custom views. So, to run SPARQL queries with the approach above, I'd > have to use sub-SELECTs rather than views, which in theory should work > just fine and should be an implementation detail. > > But second, I assume that we use the query optimizer of MySQL, which > is unable to simplify the humongous SQL query from the approach > described above into something that runs in acceptable runtime (as I > demonstrated in [1]). > > Now if you happen to work for Oracle then you might say, “well they > should just use a real database.” We can all chuckle about that for a > minute and then get back to business. There are existing systems, such > as D2RQ, that, whatever their limitations, produce decent performance > of MySQL and other “lesser” database engines. This group *has* to > standardise on a solution that is implementable on such engines. > > So, how do we get acceptable performance on MySQL and other “lesser” > RDBMS, if we cannot use the “triple view” or “triple subselect” approach? > > Well, we cannot translate SPARQL queries into humongous SQL queries > and then rely on the DB engine to simplify it so it runs in a > reasonable time. We have to be smarter in the translation, and create > SQL queries that are reasonably optimised straight away. I will not > get into the details, which are complicated, but it means we can no > longer treat the mapping's SQL queries as opaque blobs of SQL text > that we can just pass to the DB without looking at them -- we have to > dive into the SQL queries that define the mapping, analyse what they > are doing, and take them apart. > > > 3. How has this problem been solved in practice to date? > -------------------------------------------------------- > > Here is the “worse is better” approach to solve this problem: We can > ask the *mapping author* to do the work for us and decompose the SQL > query into simpler elements (join conditions, projection expressions, > selection conditions and so on) and explain how they relate to each > other through the structure of the mapping file. Then the > SPARQL-to-SQL translation engine can build the optimised SQL query > straight from these simpler SQL fragments. This is what is done in the > D2RQ mapping language (see [2]). > > It is noteworthy that, to my knowledge, *every* RDB2RDF system to date > that supports the evaluation of SPARQL queries over mapped databases, > and assumes read-only access to the database, has opted for an > approach similar to this: D2RQ, OpenLink Virtuoso, SquirrelRDF, R2O. > None of their mapping languages specify the mapping using complete SQL > queries; all languages decompose the queries into small chunks. > > To the best of my knowledge, there is *no* existing implementation > that supports SPARQL over the mapped database, supports read-only > access, and uses a mapping language based on the SQL query approach. > There are implementations of the SQL query approach that allow RDF > dumps of a mappded database (e.g., D2R Map) or resource-based linked > data style access (e.g., Triplify). But supporting SPARQL queries over > the mapped database is a task that is a whole lot more difficult. > > > 4. How can we save the SQL query based approach? > ------------------------------------------------ > > So AFAIK no one has implemented the SQL query approach to support > SPARQL queries over mapped databases. It doesn't necessarily follow > that it's impossible, or even a bad idea. Could we specify our > mappings using arbitrary SQL queries, then translate SPARQL queries > over those mappings to SQL, and still end up with reasonably optimised > SQL queries? > > If this is possible at reasonable implementation cost, then it would > be a great way forward. > > I can imagine two approaches. > > First, you could develop your own custom SQL optimizer that takes the > humongous SQL query resulting from the triple view approach and > optimizes it to make the DB engine happy. I assert without proof that > the implementation cost for this is prohibitive, especially because > one has to create a different SQL optimiser for each imperfect > database engine that one wants to support (because their native > optimisers have different weaknesses, and because their SQL dialects > differ). > > Second approach: Do not allow arbitrary SQL queries in the mapping > language, but only a restricted subset. Then write a SQL parser that > is just smart enough to chop these restricted SQL queries into their > elements (such as join conditions, projection expressions, selection > conditions and so on). > > So, while the existing implementations (D2RQ, Virtuoso, etc) ask the > mapping author to do the job of decomposing the query into simpler > elements as part of the process of writing a mapping, we would now > have a parser that does the same job -- its input is a restricted SQL > query and its output are those simpler elements. > > In practice, this will not be as simple as it might sound. It appears > that one of the design goals of SQL was to make parser implementation > as difficult as possible. This is compounded by the many differences > between SQL dialects. > > Nevertheless, this approach seems promising, and it *might* be a way > of supporting SPARQL queries on MySQL and other “lesser” DB engines, > over a mapping language that uses the SQL query based approach. > > > 5. Request for an existence proof > --------------------------------- > > It seems that the proponents of the SQL query based approach fall into > two camps: > > 1. Those whose plan to rely on their DB engine's great optimizer for > doing all the hard work, and don't care wether it works on other > databases > > 2. Those who have not really been hit by the practicalities of > implementing a SPARQL engine over such a mapping when no good SQL > optimizer is available > > Let me repeat that I believe that the SQL query based approach is > better than the alternatives on almost every scale. The only problem > is that it has not been shown that it can be implemented at reasonable > cost in the absence of an advanced SQL optimizer. My concern is this: > If the group standardises an approach that is only implementable on > Oracle and SQL Server, then the group has failed. I hope that there is > consensus on this question; if not, better bring it on the table NOW. > > > If there was any implementation that used the SQL query based approach > as a mapping language, > parsed the mapping's SQL queries, and translates SPARQL queries into > SQL queries that are significantly simpler than the humongous SQL > queries produced by the “triple view” approach, then I'd be a whole > lot more confident that the SQL query based approach ban be made to > work on databases such as MySQL. > > A good benchmark might be self-joins. Can you translate SPARQL queries > over the mapped DB into SQL queries that don't contain self-joins > (joining a table to itself on the PK)? > > So, how would you solve this? Can you make the SQL based approach work > without an awesome SQL optimizer? What if you have to support multiple > SQL dialects? > > Best, > Richard > > > [1] http://www.w3.org/2001/sw/rdb2rdf/wiki/PotentialSQLIssues > [2] http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/#specification > > > >> >> Thanks, >> - Souri. >> > >
Received on Monday, 22 March 2010 22:04:52 UTC