On Parameters from Orri Erling on 2009-03-25 (public-rdf-dawg@w3.org from January to March 2009)

From: Orri Erling <erling@xs4all.nl>
Date: Wed, 25 Mar 2009 11:40:22 +0100
To: "'SPARQL Working Group'" <public-rdf-dawg@w3.org>
Message-Id: <200903251041.n2PAf0iJ022287@smtp-vbr1.xs4all.nl>
            

 

 

 

Following are some clarifications for the SPARQL parameters question.

 

The rationale for having parameters is the possibility of skipping query
compilation if the same parametrized query is to be executed multiple times
with different parameters.

 

In all SQL CLI's (call level interface) this is a given.   Also, most CLI's
have array  parameters, i.e. passing  multiple sets of parameter bindings in
a single client-server message.  The array parameter question can be
addressed in the  conntext of the SPARQL protocol by the HTTP 1.1 pipelining

 

Experience demonstrates that with simple workloads consisting of short
queries, such as the Berlin benchmark, up to 40% of server CPU ttime is
spent compiling the queries.  Most of  query compilation time is in turn
taken by deciding the join order,  which is typically done by generating
candidate plans and by running these through a cost model.  This is in
principle n! in complexity, where n is the number of joined tables, e.g.
triple patterns for SPARQL.  While real optimizers do better than n!, this
is still an exppensive operation.

 

 

The implementations in existence now (Virtuoso, Arq)  use the variable
syntax for parameters.  It may be argued that this is suboptimal, as it does
not allow signalling an error when a query intended for running with
parameters is invoked with too few parameters.  This is trivial to rectify
by introducing a special syntactic form for parameters, e.g. ?:xx.

 

This proposal was criticized for having potentially unforeseen implication
for optimization.  In SQL, this is not so much of an issue since the tables
and columns referenced in the query are always explicit.  Even in SQL,long
running queries should preferentially be written with literals and not
parameters. Consider:

 

select sum (l_extendedprice) from lineitem where l_returnflag = 'r' and
l_deliverydate >?;

If there is an index on delivery date and the value for delivery date is
selective, this should go by the index and if the date is not selective this
should go as a full table scan.  This is why parameters are discouraged for
this type of query, see TPC H.

 

 

With SPARQL, if you have 

 

select ?x where {?s a ?c . ?s ?p ?name}

 

and ?class and ?p and ?name are parameters, the optimizer has no very firm
grounds for join order.  Since there is one object  given, ?name, the 2nd
pattern is probably the more selective.  We note that there are generally
many rdf:type triples but that the cardinality of the object with these
triples is less than with triples in general.

 

We will note that a query of this sort is next to unworkable with on the fly
SPARQL to SQL  mapping but then it would be so even ifwithout parameters, if
?s ?p and ?name were not bound in the invoking context.

 

The use case for parameters is a short lookup query where the classes and
predicates that we expect to be literals are enough to give sufficient
cardinality information for optimization.  All the Berlin benchmark queries
are examples of such.

 

 

Without introduction of parameters, it is possible to recycle SPARQL query
plans by keeping a cache of query parse treeswith blanks for literals.  When
another query comes in where the literals are different, the same query plan
could be reused, just by plugging in the values in the executable form of
the query.  This is more difficult to implement and slightly slower at run
time than explicit parameters.  More importantly, this could lead to complex
misunderstandings when a literal did have significant impact on execution
plan.

 

Consider 

 

select sum (?ep) where { ?l a tpch:lineitem . ?l tpch:l_extendedprice ?ep .
?l tpch:l_returnflag "r". ?l tpch:l_deliverydate ?d . filter (?d >
"2009-1-1"^^xsd:date)}.

Suppose that the data were stored as physical triples.  There would be two
places to begin the join, one with returnflag  and the other with delivery
date.

 

We presume that there are relatively few returned items.  Hence the
optimizer, seeing the "r"would begin with this.  If this were "n" for not
returned, the date would be the better starting point.  If an automatic
reuse of plans were to take place, we would lose the advantage given by
knowing the literal at compile time for the second execution of a query with
different literals.

 

Needless to say, we would not consider variables in predicate or class
positions as substitutable in query reuse.

 

So, when one puts a parameter in a query, one de facto states that the run
time value will have little impact on cardinality for optimization purposes.
This is the difference between automatic reuse and explicit parameters.
Further, automatic reuse is more code and takes longer at run time, even
though over 90% of compiler time goes to deciding the join order.

 

One could further finesse an automatic reuse system by making a literal
substitutable  only if it were seen from the predicate that it did in fact
have a fairly even distribution and if the comparison were equality.  Thus
the query for the value of returned items would not be a candidate since the
date is a filter with > and the return flag does not have an even
distribution.  Further, plan reuse should only be considered for queries
where the actual execution time ended up being short, let us say less than
50 times the query compilation time and  under a second of real time.

 

To summarize, automatic reuse can have, in spite of all the above measures,
some surprising effects on optimizeability. I would say that a scheme like
outlined above would make automatic reuse relatively safe though.

 

As a developer of online applications and DBMS's for such, I still would
definitely use parameters where these did fit, as per SQL usage. 

I would also not wish the optimizer to change execution plans for queries
that repeated all the time.  I would in fact go as far as to making some
statistics fixed so as to keep the plan stable.  Oracle for example has
means for this in SQL.  The rationale is not to have the best plan but to
have predictable response.  The scenario to avoid is having an online portal
suddenly lose performance because of some shift in cardinalities leading to
the system switching to a bad plan.  Having to debug such in a live
situation should be avoided.  Parameters is a way of making things more
predictable by explicitly identifying the places where even distribution is
expected.  In online applications, the developer ought  to  know this.

 

For analytics and ad hoc querying, parameters are not an issue to begin
with.

 

 

Orri
Received on Wednesday, 25 March 2009 10:41:36 UTC