Re: On Parameters from Steve Harris on 2009-03-25 (public-rdf-dawg@w3.org from January to March 2009)

From: Steve Harris <steve.harris@garlik.com>
Date: Wed, 25 Mar 2009 12:03:03 +0000
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <1E9DDE44-BF9F-4748-8AE3-970D24D1B24A@garlik.com>
First of all, I think that parameterised queries are a good idea in  
general, though really only to provide reliable escaping - something  
that SPARQL client libraries could do just as well, as in early ODBC  
implementations.

However, my impression from this discussion is that we are is a long  
way from having enough real world implementation experience to make a  
successful standardisation attempt. There is a very large burden for  
future implementers from us attempting to standardise this and getting  
it wrong.

Without even having implemented this scheme I see numerous issues, for  
example:

* How does this interact with new protocol extensions, such as the  
format=<mime-type> extension that a few people seem to have  
implemented, with queries such as SELECT ?x WHERE { ?x :uses ? 
format }. Do we ban variables like "?query", have reserved words? None  
of the options seem very palatable.

* How are the processors supposed to determine the type of the  
arguments? eg is x=1 the integer "1"^^xsd:integer, the xsd:string/ 
plain literal "1", or the local URI <1>, and so on. ...&x=%221%22%5E%5E 
%3Chttp%3A//www.w3.org/2001/XMLSchema%23integer%3E seems somewhat  
clumsy.

* If this is supposed to be about helping the optimiser then there are  
more appropriate features. Something closer to stored procedures would  
be much easier to handle, have fewer nasty corner cases, better  
opportunities for optimiser hinting, and lower ambiguity.

Consider the pathological (and common) parameterised query:
SELECT * WHERE { ?s ?p ?o }

I can call this with ?p=a, ?o=foaf:Person, or ?s=<http://plugin.org.uk/swh.xrdf#me 
 >, or any number of perfectly reasonable options, but all of which  
have wildly different optimisation strategies. All are perfectly  
sensible queries, given that the user has access to a working URI  
encoder, and no working SPARQL constant escaper (a very common  
situation currently).

All in all this all feels a lot like research, the implementation  
experience from SQL transport layers does not carry over (the  
syntactic and protocol differences in SQL are significant, ODBC for  
example does not standardise the protocol), and I'm highly unconvinced  
that we can do a good job of standardising it, at this point.

As a general note, I personally don't find the "SQL does it, so should  
we" viewpoint to be very persuasive. SPARQL is not SQL, and many  
things in SQL are questionable. I think we should be looking to learn  
from their mistakes, as well as the things they did well.

- Steve

On 25 Mar 2009, at 10:40, Orri Erling wrote:

>
>
>
>
> Following are some clarifications for the SPARQL parameters question.
>
> The rationale for having parameters is the possibility of skipping  
> query compilation if the same parametrized query is to be executed  
> multiple times with different parameters.
>
> In all SQL CLI's (call level interface) this is a given.   Also,  
> most CLI's have array  parameters, i.e. passing  multiple sets of  
> parameter bindings in a single client-server message.  The array  
> parameter question can be addressed in the  conntext of the SPARQL  
> protocol by the HTTP 1.1 pipelining
>
> Experience demonstrates that with simple workloads consisting of  
> short queries, such as the Berlin benchmark, up to 40% of server CPU  
> ttime is spent compiling the queries.  Most of  query compilation  
> time is in turn taken by deciding the join order,  which is  
> typically done by generating  candidate plans and by running these  
> through a cost model.  This is in principle n! in complexity, where  
> n is the number of joined tables, e.g. triple patterns for SPARQL.   
> While real optimizers do better than n!, this is still an exppensive  
> operation.
>
>
> The implementations in existence now (Virtuoso, Arq)  use the  
> variable syntax for parameters.  It may be argued that this is  
> suboptimal, as it does not allow signalling an error when a query  
> intended for running with parameters is invoked with too few  
> parameters.  This is trivial to rectify by introducing a special  
> syntactic form for parameters, e.g. ?:xx.
>
> This proposal was criticized for having potentially unforeseen  
> implication for optimization.  In SQL, this is not so much of an  
> issue since the tables and columns referenced in the query are  
> always explicit.  Even in SQL,long running queries should  
> preferentially be written with literals and not parameters. Consider:
>
> select sum (l_extendedprice) from lineitem where l_returnflag = 'r'  
> and l_deliverydate >?;
> If there is an index on delivery date and the value for delivery  
> date is selective, this should go by the index and if the date is  
> not selective this should go as a full table scan.  This is why  
> parameters are discouraged for this type of query, see TPC H.
>
>
> With SPARQL, if you have
>
> select ?x where {?s a ?c . ?s ?p ?name}
>
> and ?class and ?p and ?name are parameters, the optimizer has no  
> very firm grounds for join order.  Since there is one object   
> given, ?name, the 2nd pattern is probably the more selective.  We  
> note that there are generally many rdf:type triples but that the  
> cardinality of the object with these triples is less than with  
> triples in general.
>
> We will note that a query of this sort is next to unworkable with on  
> the fly SPARQL to SQL  mapping but then it would be so even  
> ifwithout parameters, if ?s ?p and ?name were not bound in the  
> invoking context.
>
> The use case for parameters is a short lookup query where the  
> classes and predicates that we expect to be literals are enough to  
> give sufficient cardinality information for optimization.  All the  
> Berlin benchmark queries are examples of such.
>
>
> Without introduction of parameters, it is possible to recycle SPARQL  
> query plans by keeping a cache of query parse treeswith blanks for  
> literals.  When another query comes in where the literals are  
> different, the same query plan could be reused, just by plugging in  
> the values in the executable form of the query.  This is more  
> difficult to implement and slightly slower at run time than explicit  
> parameters.  More importantly, this could lead to complex  
> misunderstandings when a literal did have significant impact on  
> execution plan.
>
> Consider
>
> select sum (?ep) where { ?l a tpch:lineitem . ?l  
> tpch:l_extendedprice ?ep . ?l tpch:l_returnflag "r". ?l  
> tpch:l_deliverydate ?d . filter (?d > "2009-1-1"^^xsd:date)}.
> Suppose that the data were stored as physical triples.  There would  
> be two places to begin the join, one with returnflag  and the other  
> with delivery date.
>
> We presume that there are relatively few returned items.  Hence the  
> optimizer, seeing the "r"would begin with this.  If this were "n"  
> for not returned, the date would be the better starting point.  If  
> an automatic reuse of plans were to take place, we would lose the  
> advantage given by knowing the literal at compile time for the  
> second execution of a query with different literals.
>
> Needless to say, we would not consider variables in predicate or  
> class positions as substitutable in query reuse.
>
> So, when one puts a parameter in a query, one de facto states that  
> the run time value will have little impact on cardinality for  
> optimization purposes.  This is the difference between automatic  
> reuse and explicit parameters.  Further, automatic reuse is more  
> code and takes longer at run time, even though over 90% of compiler  
> time goes to deciding the join order.
>
> One could further finesse an automatic reuse system by making a  
> literal substitutable  only if it were seen from the predicate that  
> it did in fact have a fairly even distribution and if the comparison  
> were equality.  Thus the query for the value of returned items would  
> not be a candidate since the date is a filter with > and the return  
> flag does not have an even distribution.  Further, plan reuse should  
> only be considered for queries where the actual execution time ended  
> up being short, let us say less than 50 times the query compilation  
> time and  under a second of real time.
>
> To summarize, automatic reuse can have, in spite of all the above  
> measures, some surprising effects on optimizeability. I would say  
> that a scheme like outlined above would make automatic reuse  
> relatively safe though.
>
> As a developer of online applications and DBMS's for such, I still  
> would definitely use parameters where these did fit, as per SQL usage.
> I would also not wish the optimizer to change execution plans for  
> queries that repeated all the time.  I would in fact go as far as to  
> making some statistics fixed so as to keep the plan stable.  Oracle  
> for example has means for this in SQL.  The rationale is not to have  
> the best plan but to have predictable response.  The scenario to  
> avoid is having an online portal suddenly lose performance because  
> of some shift in cardinalities leading to the system switching to a  
> bad plan.  Having to debug such in a live situation should be  
> avoided.  Parameters is a way of making things more predictable by  
> explicitly identifying the places where even distribution is   
> expected.  In online applications, the developer ought  to  know this.
>
> For analytics and ad hoc querying, parameters are not an issue to  
> begin with.
>
>
> Orri
>
>
>
>
>
>

-- 
Steve Harris
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  
9AD
Received on Wednesday, 25 March 2009 12:03:40 UTC