- From: Orri Erling <erling@xs4all.nl>
- Date: Wed, 25 Mar 2009 11:40:22 +0100
- To: "'SPARQL Working Group'" <public-rdf-dawg@w3.org>
- Message-Id: <200903251041.n2PAf0iJ022287@smtp-vbr1.xs4all.nl>
Following are some clarifications for the SPARQL parameters question. The rationale for having parameters is the possibility of skipping query compilation if the same parametrized query is to be executed multiple times with different parameters. In all SQL CLI's (call level interface) this is a given. Also, most CLI's have array parameters, i.e. passing multiple sets of parameter bindings in a single client-server message. The array parameter question can be addressed in the conntext of the SPARQL protocol by the HTTP 1.1 pipelining Experience demonstrates that with simple workloads consisting of short queries, such as the Berlin benchmark, up to 40% of server CPU ttime is spent compiling the queries. Most of query compilation time is in turn taken by deciding the join order, which is typically done by generating candidate plans and by running these through a cost model. This is in principle n! in complexity, where n is the number of joined tables, e.g. triple patterns for SPARQL. While real optimizers do better than n!, this is still an exppensive operation. The implementations in existence now (Virtuoso, Arq) use the variable syntax for parameters. It may be argued that this is suboptimal, as it does not allow signalling an error when a query intended for running with parameters is invoked with too few parameters. This is trivial to rectify by introducing a special syntactic form for parameters, e.g. ?:xx. This proposal was criticized for having potentially unforeseen implication for optimization. In SQL, this is not so much of an issue since the tables and columns referenced in the query are always explicit. Even in SQL,long running queries should preferentially be written with literals and not parameters. Consider: select sum (l_extendedprice) from lineitem where l_returnflag = 'r' and l_deliverydate >?; If there is an index on delivery date and the value for delivery date is selective, this should go by the index and if the date is not selective this should go as a full table scan. This is why parameters are discouraged for this type of query, see TPC H. With SPARQL, if you have select ?x where {?s a ?c . ?s ?p ?name} and ?class and ?p and ?name are parameters, the optimizer has no very firm grounds for join order. Since there is one object given, ?name, the 2nd pattern is probably the more selective. We note that there are generally many rdf:type triples but that the cardinality of the object with these triples is less than with triples in general. We will note that a query of this sort is next to unworkable with on the fly SPARQL to SQL mapping but then it would be so even ifwithout parameters, if ?s ?p and ?name were not bound in the invoking context. The use case for parameters is a short lookup query where the classes and predicates that we expect to be literals are enough to give sufficient cardinality information for optimization. All the Berlin benchmark queries are examples of such. Without introduction of parameters, it is possible to recycle SPARQL query plans by keeping a cache of query parse treeswith blanks for literals. When another query comes in where the literals are different, the same query plan could be reused, just by plugging in the values in the executable form of the query. This is more difficult to implement and slightly slower at run time than explicit parameters. More importantly, this could lead to complex misunderstandings when a literal did have significant impact on execution plan. Consider select sum (?ep) where { ?l a tpch:lineitem . ?l tpch:l_extendedprice ?ep . ?l tpch:l_returnflag "r". ?l tpch:l_deliverydate ?d . filter (?d > "2009-1-1"^^xsd:date)}. Suppose that the data were stored as physical triples. There would be two places to begin the join, one with returnflag and the other with delivery date. We presume that there are relatively few returned items. Hence the optimizer, seeing the "r"would begin with this. If this were "n" for not returned, the date would be the better starting point. If an automatic reuse of plans were to take place, we would lose the advantage given by knowing the literal at compile time for the second execution of a query with different literals. Needless to say, we would not consider variables in predicate or class positions as substitutable in query reuse. So, when one puts a parameter in a query, one de facto states that the run time value will have little impact on cardinality for optimization purposes. This is the difference between automatic reuse and explicit parameters. Further, automatic reuse is more code and takes longer at run time, even though over 90% of compiler time goes to deciding the join order. One could further finesse an automatic reuse system by making a literal substitutable only if it were seen from the predicate that it did in fact have a fairly even distribution and if the comparison were equality. Thus the query for the value of returned items would not be a candidate since the date is a filter with > and the return flag does not have an even distribution. Further, plan reuse should only be considered for queries where the actual execution time ended up being short, let us say less than 50 times the query compilation time and under a second of real time. To summarize, automatic reuse can have, in spite of all the above measures, some surprising effects on optimizeability. I would say that a scheme like outlined above would make automatic reuse relatively safe though. As a developer of online applications and DBMS's for such, I still would definitely use parameters where these did fit, as per SQL usage. I would also not wish the optimizer to change execution plans for queries that repeated all the time. I would in fact go as far as to making some statistics fixed so as to keep the plan stable. Oracle for example has means for this in SQL. The rationale is not to have the best plan but to have predictable response. The scenario to avoid is having an online portal suddenly lose performance because of some shift in cardinalities leading to the system switching to a bad plan. Having to debug such in a live situation should be avoided. Parameters is a way of making things more predictable by explicitly identifying the places where even distribution is expected. In online applications, the developer ought to know this. For analytics and ad hoc querying, parameters are not an issue to begin with. Orri
Received on Wednesday, 25 March 2009 10:41:36 UTC