- From: Ezzat, Ahmed <Ahmed.Ezzat@hp.com>
- Date: Tue, 23 Mar 2010 14:33:22 +0000
- To: "ashok.malhotra@oracle.com" <ashok.malhotra@oracle.com>, Richard Cyganiak <richard@cyganiak.de>
- CC: Souripriya Das <souripriya.das@oracle.com>, Public-Rdb2rdf-Wg <public-rdb2rdf-wg@w3.org>
Hi All, Let us avoid questioning motives. I can share that sub-queries works well with many databases including HP, Teradata, etc... Regards, Ahmed -----Original Message----- From: public-rdb2rdf-wg-request@w3.org [mailto:public-rdb2rdf-wg-request@w3.org] On Behalf Of ashok malhotra Sent: Tuesday, March 23, 2010 5:16 AM To: Richard Cyganiak Cc: Souripriya Das; Public-Rdb2rdf-Wg Subject: Re: RDB2RDF mapping: Do we really need any alternative to use of SQL queries with conventions and a "trivial" mapping language? It's a bit of a leap to move from the performance problems of MySQL on subselects to > It is not even clear if this approach can be implemented efficiently on any database engines other than yours. All the best, Ashok Richard Cyganiak wrote: > Ashok, > > On 22 Mar 2010, at 22:03, ashok malhotra wrote: >> I think it's a poor argument to say that because MySQL does a poor >> job with sub-selects >> we should abandon the SQL-view approach. > > I did not say that the SQL query approach should be abandoned. > > I also did not cite problems with sub-selects as a reason for my > reservations about the approach. > > So let's turn this around. > > There are several perfectly fine RDB2RDF mapping languages and RDB2RDF > implementations that have been around for years and that do quite well > on those "lesser" databases. None of them use the SQL query approach. > It would be hardly unreasonable to argue that one of those proven > approaches should be standardised by the WG, rather than the SQL query > based approach. > > From that point of view, the need for the SQL query based approach has > not been conclusively shown. It is not even clear if this approach can > be implemented efficiently on any database engines other than yours. > > Best, > Richard > > > >> As I said in my earlier note >> Pl. take a look at >> http://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html >> which explains how to rewrite subqueries as joins. >> >> Also, the focus of this WG is on mapping Relational Data to RDF, not >> on optimizing SPARQL >> queries. I think this is a very interesting area in which various >> vendors will compete and some >> approaches and some databases will do better than others. >> All the best, Ashok >> >> >> Richard Cyganiak wrote: >>> On 22 Mar 2010, at 06:03, Souripriya Das wrote: >>>> So far I have not seen or heard any convincing arguments to >>>> establish that we need anything more than SQL and a "trivial" >>>> mapping language. Before going for an alternative, we must first >>>> establish the need for such an alternative. >>> >>> Fair enough. >>> >>> I have thought a lot about this in the previous weeks and have been >>> flip-flopping on the issue. This is why I haven't written up a >>> better explanation of the problem previously -- I'm not yet 100% >>> sure what my own opinion on the issue is. >>> >>> Anyway, I will explain it here as good as I can. This is going to be >>> quite long, sorry about that. >>> >>> >>> 1. Why the SQL query based approach is nice >>> ------------------------------------------- >>> >>> First, I definitely see the attraction of the SQL query based >>> approach. I understand this approach as: leveraging the expressivity >>> of SQL to do as much of the mapping/transformation as possible, with >>> some simple glue around it that essentially turns each SQL result >>> record into a few triples according to some simple rules. >>> >>> It's attractive because the approach leverages existing SQL >>> knowledge of mapping authors; it maximises expressivity; it means we >>> don't have to specify a large chunk of the problem ourselves; it >>> produces syntactically compact mappings. So, purely from an >>> authoring point of view it is definitely a nicer approach than any >>> of the proposed alternatives (D2RQ, Virtuoso RDF views, R2O etc). >>> >>> In order to run SPARQL queries against such a mapped database, one >>> would use the "triple view" approach, as detailed in Juan's work. So >>> the SPARQL-to-SQL engine would create a single view in the DB which >>> consists of lots of unions and in the end contains one row for each >>> mapped triple, with subject, predicate and object. How to run SPARQL >>> queries against such a relational structure is well-known from prior >>> work on database-backed triple stores. The result is a humongous SQL >>> query over a humongous view definition, but as Juan has shown, good >>> SQL optimizers can simplify this into a reasonable query plan. >>> >>> So here is why I argue against this approach. >>> >>> >>> 2. Why the SQL query based approach fails in some cases >>> ------------------------------------------------------- >>> >>> First, I assume read-only access to the database. I cannot create >>> custom views. So, to run SPARQL queries with the approach above, I'd >>> have to use sub-SELECTs rather than views, which in theory should >>> work just fine and should be an implementation detail. >>> >>> But second, I assume that we use the query optimizer of MySQL, which >>> is unable to simplify the humongous SQL query from the approach >>> described above into something that runs in acceptable runtime (as I >>> demonstrated in [1]). >>> >>> Now if you happen to work for Oracle then you might say, "well they >>> should just use a real database." We can all chuckle about that for >>> a minute and then get back to business. There are existing systems, >>> such as D2RQ, that, whatever their limitations, produce decent >>> performance of MySQL and other "lesser" database engines. This group >>> *has* to standardise on a solution that is implementable on such >>> engines. >>> >>> So, how do we get acceptable performance on MySQL and other "lesser" >>> RDBMS, if we cannot use the "triple view" or "triple subselect" >>> approach? >>> >>> Well, we cannot translate SPARQL queries into humongous SQL queries >>> and then rely on the DB engine to simplify it so it runs in a >>> reasonable time. We have to be smarter in the translation, and >>> create SQL queries that are reasonably optimised straight away. I >>> will not get into the details, which are complicated, but it means >>> we can no longer treat the mapping's SQL queries as opaque blobs of >>> SQL text that we can just pass to the DB without looking at them -- >>> we have to dive into the SQL queries that define the mapping, >>> analyse what they are doing, and take them apart. >>> >>> >>> 3. How has this problem been solved in practice to date? >>> -------------------------------------------------------- >>> >>> Here is the "worse is better" approach to solve this problem: We can >>> ask the *mapping author* to do the work for us and decompose the SQL >>> query into simpler elements (join conditions, projection >>> expressions, selection conditions and so on) and explain how they >>> relate to each other through the structure of the mapping file. Then >>> the SPARQL-to-SQL translation engine can build the optimised SQL >>> query straight from these simpler SQL fragments. This is what is >>> done in the D2RQ mapping language (see [2]). >>> >>> It is noteworthy that, to my knowledge, *every* RDB2RDF system to >>> date that supports the evaluation of SPARQL queries over mapped >>> databases, and assumes read-only access to the database, has opted >>> for an approach similar to this: D2RQ, OpenLink Virtuoso, >>> SquirrelRDF, R2O. None of their mapping languages specify the >>> mapping using complete SQL queries; all languages decompose the >>> queries into small chunks. >>> >>> To the best of my knowledge, there is *no* existing implementation >>> that supports SPARQL over the mapped database, supports read-only >>> access, and uses a mapping language based on the SQL query approach. >>> There are implementations of the SQL query approach that allow RDF >>> dumps of a mappded database (e.g., D2R Map) or resource-based linked >>> data style access (e.g., Triplify). But supporting SPARQL queries >>> over the mapped database is a task that is a whole lot more difficult. >>> >>> >>> 4. How can we save the SQL query based approach? >>> ------------------------------------------------ >>> >>> So AFAIK no one has implemented the SQL query approach to support >>> SPARQL queries over mapped databases. It doesn't necessarily follow >>> that it's impossible, or even a bad idea. Could we specify our >>> mappings using arbitrary SQL queries, then translate SPARQL queries >>> over those mappings to SQL, and still end up with reasonably >>> optimised SQL queries? >>> >>> If this is possible at reasonable implementation cost, then it would >>> be a great way forward. >>> >>> I can imagine two approaches. >>> >>> First, you could develop your own custom SQL optimizer that takes >>> the humongous SQL query resulting from the triple view approach and >>> optimizes it to make the DB engine happy. I assert without proof >>> that the implementation cost for this is prohibitive, especially >>> because one has to create a different SQL optimiser for each >>> imperfect database engine that one wants to support (because their >>> native optimisers have different weaknesses, and because their SQL >>> dialects differ). >>> >>> Second approach: Do not allow arbitrary SQL queries in the mapping >>> language, but only a restricted subset. Then write a SQL parser that >>> is just smart enough to chop these restricted SQL queries into their >>> elements (such as join conditions, projection expressions, selection >>> conditions and so on). >>> >>> So, while the existing implementations (D2RQ, Virtuoso, etc) ask the >>> mapping author to do the job of decomposing the query into simpler >>> elements as part of the process of writing a mapping, we would now >>> have a parser that does the same job -- its input is a restricted >>> SQL query and its output are those simpler elements. >>> >>> In practice, this will not be as simple as it might sound. It >>> appears that one of the design goals of SQL was to make parser >>> implementation as difficult as possible. This is compounded by the >>> many differences between SQL dialects. >>> >>> Nevertheless, this approach seems promising, and it *might* be a way >>> of supporting SPARQL queries on MySQL and other "lesser" DB engines, >>> over a mapping language that uses the SQL query based approach. >>> >>> >>> 5. Request for an existence proof >>> --------------------------------- >>> >>> It seems that the proponents of the SQL query based approach fall >>> into two camps: >>> >>> 1. Those whose plan to rely on their DB engine's great optimizer for >>> doing all the hard work, and don't care wether it works on other >>> databases >>> >>> 2. Those who have not really been hit by the practicalities of >>> implementing a SPARQL engine over such a mapping when no good SQL >>> optimizer is available >>> >>> Let me repeat that I believe that the SQL query based approach is >>> better than the alternatives on almost every scale. The only problem >>> is that it has not been shown that it can be implemented at >>> reasonable cost in the absence of an advanced SQL optimizer. My >>> concern is this: If the group standardises an approach that is only >>> implementable on Oracle and SQL Server, then the group has failed. I >>> hope that there is consensus on this question; if not, better bring >>> it on the table NOW. >>> >>> >>> If there was any implementation that used the SQL query based >>> approach as a mapping language, >>> parsed the mapping's SQL queries, and translates SPARQL queries into >>> SQL queries that are significantly simpler than the humongous SQL >>> queries produced by the "triple view" approach, then I'd be a whole >>> lot more confident that the SQL query based approach ban be made to >>> work on databases such as MySQL. >>> >>> A good benchmark might be self-joins. Can you translate SPARQL >>> queries over the mapped DB into SQL queries that don't contain >>> self-joins (joining a table to itself on the PK)? >>> >>> So, how would you solve this? Can you make the SQL based approach >>> work without an awesome SQL optimizer? What if you have to support >>> multiple SQL dialects? >>> >>> Best, >>> Richard >>> >>> >>> [1] http://www.w3.org/2001/sw/rdb2rdf/wiki/PotentialSQLIssues >>> [2] http://www4.wiwiss.fu-berlin.de/bizer/D2RQ/spec/#specification >>> >>> >>> >>>> >>>> Thanks, >>>> - Souri. >>>> >>> >>> > >
Received on Tuesday, 23 March 2010 14:34:24 UTC