RE: On Parameters from Seaborne, Andy on 2009-03-25 (public-rdf-dawg@w3.org from January to March 2009)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Wed, 25 Mar 2009 15:20:05 +0000
To: Steve Harris <steve.harris@garlik.com>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <B6CF1054FDC8B845BF93A6645D19BEA3628D9A721A@GVW1118EXC.americas.hpqcorp.net>
I also feel nervous that the parameters feature is being driven by optimization concerns and an appeal to the analogy with SQL.  I'd feel more comfortable if it were driven by some application-writer added-value; that is, an external advantage, not an implementers and tuning advantage.  So far, such a use case has not come to light.

Orri:
> > Without introduction of parameters, it is possible to recycle SPARQL
> > query plans by keeping a cache of query parse treeswith blanks for
> > literals.  When another query comes in where the literals are
> > different, the same query plan could be reused, just by plugging in
> > the values in the executable form of the query.  This is more
> > difficult to implement and slightly slower at run time than explicit
> > parameters.  More importantly, this could lead to complex
> > misunderstandings when a literal did have significant impact on
> > execution plan.

I don't know how hard/easy it would be to make this work in practice but looks quite reasonable and something that does not need query parameters, yet delivers the benefits of amortized query plans.  This is one reason that it might be too early to use optimization as a driving reason for parameters.

----

If we do spec this feature, I'd like to see both mandatory and optional parameters.  Optional parameters can done using the existing named variables - the advantage of optional is that when results from one query are being used to drive another, optionals do occur.  The added checking of mandatory parameters, with variables with a distinguished name (?:name was suggested) is, I feel, sufficiently valuable to justify the additional specing.

This is not a backward compatible feature.  How much do we want to smooth the transition for SPARQL/2010 [*] clients querying SPARQL/2008 servers?

 Andy

[*] We need a better name!  And a URI.

> -----Original Message-----
> From: public-rdf-dawg-request@w3.org [mailto:public-rdf-dawg-
> request@w3.org] On Behalf Of Steve Harris
> Sent: 25 March 2009 12:03
> To: SPARQL Working Group
> Subject: Re: On Parameters
> 
> First of all, I think that parameterised queries are a good idea in
> general, though really only to provide reliable escaping - something
> that SPARQL client libraries could do just as well, as in early ODBC
> implementations.
> 
> However, my impression from this discussion is that we are is a long
> way from having enough real world implementation experience to make a
> successful standardisation attempt. There is a very large burden for
> future implementers from us attempting to standardise this and getting
> it wrong.
> 
> Without even having implemented this scheme I see numerous issues, for
> example:
> 
> * How does this interact with new protocol extensions, such as the
> format=<mime-type> extension that a few people seem to have
> implemented, with queries such as SELECT ?x WHERE { ?x :uses ?
> format }. Do we ban variables like "?query", have reserved words? None
> of the options seem very palatable.
> 
> * How are the processors supposed to determine the type of the
> arguments? eg is x=1 the integer "1"^^xsd:integer, the xsd:string/
> plain literal "1", or the local URI <1>, and so on. ...&x=%221%22%5E%5E
> %3Chttp%3A//www.w3.org/2001/XMLSchema%23integer%3E seems somewhat
> clumsy.
> 
> * If this is supposed to be about helping the optimiser then there are
> more appropriate features. Something closer to stored procedures would
> be much easier to handle, have fewer nasty corner cases, better
> opportunities for optimiser hinting, and lower ambiguity.
> 
> Consider the pathological (and common) parameterised query:
> SELECT * WHERE { ?s ?p ?o }
> 
> I can call this with ?p=a, ?o=foaf:Person, or
> ?s=<http://plugin.org.uk/swh.xrdf#me

>  >, or any number of perfectly reasonable options, but all of which
> have wildly different optimisation strategies. All are perfectly
> sensible queries, given that the user has access to a working URI
> encoder, and no working SPARQL constant escaper (a very common
> situation currently).
> 
> All in all this all feels a lot like research, the implementation
> experience from SQL transport layers does not carry over (the
> syntactic and protocol differences in SQL are significant, ODBC for
> example does not standardise the protocol), and I'm highly unconvinced
> that we can do a good job of standardising it, at this point.
> 
> As a general note, I personally don't find the "SQL does it, so should
> we" viewpoint to be very persuasive. SPARQL is not SQL, and many
> things in SQL are questionable. I think we should be looking to learn
> from their mistakes, as well as the things they did well.
> 
> - Steve
> 
> On 25 Mar 2009, at 10:40, Orri Erling wrote:
> 
> >
> >
> >
> >
> > Following are some clarifications for the SPARQL parameters question.
> >
> > The rationale for having parameters is the possibility of skipping
> > query compilation if the same parametrized query is to be executed
> > multiple times with different parameters.
> >
> > In all SQL CLI's (call level interface) this is a given.   Also,
> > most CLI's have array  parameters, i.e. passing  multiple sets of
> > parameter bindings in a single client-server message.  The array
> > parameter question can be addressed in the  conntext of the SPARQL
> > protocol by the HTTP 1.1 pipelining
> >
> > Experience demonstrates that with simple workloads consisting of
> > short queries, such as the Berlin benchmark, up to 40% of server CPU
> > ttime is spent compiling the queries.  Most of  query compilation
> > time is in turn taken by deciding the join order,  which is
> > typically done by generating  candidate plans and by running these
> > through a cost model.  This is in principle n! in complexity, where
> > n is the number of joined tables, e.g. triple patterns for SPARQL.
> > While real optimizers do better than n!, this is still an exppensive
> > operation.
> >
> >
> > The implementations in existence now (Virtuoso, Arq)  use the
> > variable syntax for parameters.  It may be argued that this is
> > suboptimal, as it does not allow signalling an error when a query
> > intended for running with parameters is invoked with too few
> > parameters.  This is trivial to rectify by introducing a special
> > syntactic form for parameters, e.g. ?:xx.
> >
> > This proposal was criticized for having potentially unforeseen
> > implication for optimization.  In SQL, this is not so much of an
> > issue since the tables and columns referenced in the query are
> > always explicit.  Even in SQL,long running queries should
> > preferentially be written with literals and not parameters. Consider:
> >
> > select sum (l_extendedprice) from lineitem where l_returnflag = 'r'
> > and l_deliverydate >?;
> > If there is an index on delivery date and the value for delivery
> > date is selective, this should go by the index and if the date is
> > not selective this should go as a full table scan.  This is why
> > parameters are discouraged for this type of query, see TPC H.
> >
> >
> > With SPARQL, if you have
> >
> > select ?x where {?s a ?c . ?s ?p ?name}
> >
> > and ?class and ?p and ?name are parameters, the optimizer has no
> > very firm grounds for join order.  Since there is one object
> > given, ?name, the 2nd pattern is probably the more selective.  We
> > note that there are generally many rdf:type triples but that the
> > cardinality of the object with these triples is less than with
> > triples in general.
> >
> > We will note that a query of this sort is next to unworkable with on
> > the fly SPARQL to SQL  mapping but then it would be so even
> > ifwithout parameters, if ?s ?p and ?name were not bound in the
> > invoking context.
> >
> > The use case for parameters is a short lookup query where the
> > classes and predicates that we expect to be literals are enough to
> > give sufficient cardinality information for optimization.  All the
> > Berlin benchmark queries are examples of such.
> >
> >
> > Without introduction of parameters, it is possible to recycle SPARQL
> > query plans by keeping a cache of query parse treeswith blanks for
> > literals.  When another query comes in where the literals are
> > different, the same query plan could be reused, just by plugging in
> > the values in the executable form of the query.  This is more
> > difficult to implement and slightly slower at run time than explicit
> > parameters.  More importantly, this could lead to complex
> > misunderstandings when a literal did have significant impact on
> > execution plan.
> >
> > Consider
> >
> > select sum (?ep) where { ?l a tpch:lineitem . ?l
> > tpch:l_extendedprice ?ep . ?l tpch:l_returnflag "r". ?l
> > tpch:l_deliverydate ?d . filter (?d > "2009-1-1"^^xsd:date)}.
> > Suppose that the data were stored as physical triples.  There would
> > be two places to begin the join, one with returnflag  and the other
> > with delivery date.
> >
> > We presume that there are relatively few returned items.  Hence the
> > optimizer, seeing the "r"would begin with this.  If this were "n"
> > for not returned, the date would be the better starting point.  If
> > an automatic reuse of plans were to take place, we would lose the
> > advantage given by knowing the literal at compile time for the
> > second execution of a query with different literals.
> >
> > Needless to say, we would not consider variables in predicate or
> > class positions as substitutable in query reuse.
> >
> > So, when one puts a parameter in a query, one de facto states that
> > the run time value will have little impact on cardinality for
> > optimization purposes.  This is the difference between automatic
> > reuse and explicit parameters.  Further, automatic reuse is more
> > code and takes longer at run time, even though over 90% of compiler
> > time goes to deciding the join order.
> >
> > One could further finesse an automatic reuse system by making a
> > literal substitutable  only if it were seen from the predicate that
> > it did in fact have a fairly even distribution and if the comparison
> > were equality.  Thus the query for the value of returned items would
> > not be a candidate since the date is a filter with > and the return
> > flag does not have an even distribution.  Further, plan reuse should
> > only be considered for queries where the actual execution time ended
> > up being short, let us say less than 50 times the query compilation
> > time and  under a second of real time.
> >
> > To summarize, automatic reuse can have, in spite of all the above
> > measures, some surprising effects on optimizeability. I would say
> > that a scheme like outlined above would make automatic reuse
> > relatively safe though.
> >
> > As a developer of online applications and DBMS's for such, I still
> > would definitely use parameters where these did fit, as per SQL usage.
> > I would also not wish the optimizer to change execution plans for
> > queries that repeated all the time.  I would in fact go as far as to
> > making some statistics fixed so as to keep the plan stable.  Oracle
> > for example has means for this in SQL.  The rationale is not to have
> > the best plan but to have predictable response.  The scenario to
> > avoid is having an online portal suddenly lose performance because
> > of some shift in cardinalities leading to the system switching to a
> > bad plan.  Having to debug such in a live situation should be
> > avoided.  Parameters is a way of making things more predictable by
> > explicitly identifying the places where even distribution is
> > expected.  In online applications, the developer ought  to  know this.
> >
> > For analytics and ad hoc querying, parameters are not an issue to
> > begin with.
> >
> >
> > Orri
> >
> >
> >
> >
> >
> >
> 
> --
> Steve Harris
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465  http://www.garlik.com/

> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
> 9AD
>
Received on Wednesday, 25 March 2009 15:21:38 UTC