RE: On Parameters from Orri Erling on 2009-03-25 (public-rdf-dawg@w3.org from January to March 2009)

From: Orri Erling <erling@xs4all.nl>
Date: Wed, 25 Mar 2009 21:50:43 +0100
To: "'Ivan Mikhailov'" <imikhailov@openlinksw.com>, "'Seaborne, Andy'" <andy.seaborne@hp.com>
Cc: "'Steve Harris'" <steve.harris@garlik.com>, "'SPARQL Working Group'" <public-rdf-dawg@w3.org>
Message-Id: <200903252051.n2PKpaLG052640@smtp-vbr5.xs4all.nl>
All

It seems to me that the whole topic is readily ameanable to experiment.  The
rest of the post is concerned with that.
As a preamble, I would say that update, expressions, aggregation, group by
and subselects are quite a bit   more important than parameters.



If one makes an online service that must do the same query over and over,
like showing a dashboard, chances are that it is not done with the 
SPARQL protocol.  Instead, it will be a database specific CLI which may have
parameters, stored procedures, whatever.  How do you go about 
this at Garlic?  Certainly, OpenLink applications that query RDF do not have
a web server talking to  a triple store over  SPARQL  protocol. 

 We could say with some justification that we leave such cases to vendor
CLI's, where something like the Jena query execution API, also 
supported byOpenLink,  is the analog of JDBC/ODBC.


Thus we are left with the question of federation.  Federation will have
situations of joining between end points.  Due to latency, it will 
be necessary to pass many requests in one round trip.  I would not consider
this an optimization but rather a sine qua non prerequisite of 
workable federated joins.  Even so, there will be people who say that
federated joins will not scale.  What is considered sufficient will 
depend on the use case.

As pointed out before, some end point query caching logic and HTTP/1.1
pipelining can deliver a lot of what parameters can.  

Since this discussion has already produced a concept for query reuse without
parameters, it may well be that  we implement it.    The 
greatest mileage would be in SPARQL benchmarks for now, later in federation.

The question then becomes, how much worse is this than parameters?  Is the
penalty of reuse without explicit parameters significant in 
comparison to 100 ms of latency on a wide area network?  Is it significant
in comparison to 200 us on a LAN?  Is the cost of no reuse of 
plans at all significant when compared to wide area network latency?  If
reuse is found to be a performance factor, then what is the added 
implementation cost of reuse without parameters as opposed to explicit
parameters?

All these questions have a quantitative answer.  A quick estimate can be had
in a matter of a few hours.

It is increasingly clear to me that parameters should be viewed in the
context of being an enabler of federation, since thee  app server to 
DBMS connection will likely go over a connected CLI which is anyway outside
the scope of the present work.  In the days of  SAG CLI,, (SQL 
Access Group/Call Level Interface)  clientt-server was the thing, today it
is a worldwide interoperable data infrastructure.

Interoperable federation is something where we do need the protocol and
informally appointing  Jena query executions to be the JDBC of RDF 
will not save us here.

The justification of parameters is tied to the place of federation in the
SPARQL 1.1 spec.  
Federation itself should be considered in the light of use cases.  Whether
parameters or query reuse will make or break these cases is the 
real question.  Suppose the use case were querying social network data for
purposes of authorization,  as in social spam filtering.  This 
would exhibit a high frequency of short queries, potentially ranging over
many sources.  Such queries would also involve joins across 
sources if these involved criteria like friends of friends are allowed to
read and if we accepted that not all  knows relations were on the 
same end point.

Also, things like calculating search ranks involve joining essentially
everything with everything, across  partitions and servers:  
Regardless of the specific rank function, attributes of the referer affect
the value of the reference.

Except for special cases where the issuer of the queries has in depth
knowledge of data colocation, federation will involve short query 
fragments sent to many places. These fragments must come in large batches to
overcome latency.  The larger the batch, the more benefit there 
is in optimizing away query compilation, since the network latency is
counted once and the query compilation time is counted per row of 
parameters.


So, to bring some real world data into the question, I would ask
implementors to disclose some metric of query compilation cost vs. 
execution cost for a query of the form select ?o from <xx> where  { <s> ?p
?o }, where <s> varies between executions.  As a data set, we 
could have some scale of LUBM to make matters simple.

Then we contrast this against the wide area network latencies in the
federation use cases that we end up tackling and we are on an objective 
footing.

These considerations would defer the decision on parameters and would couple
this with federation.  We do note that even if federation were 
not tackled in SPARQL 1.1, the presence/absence of parameters might still
impact the performance of federation by a significant amount.

We will give some numbers to back this position in the next weeks.  It is to
be demonstrated whether parameters are a make or break point for federation.
We shall see.


Orri
Received on Wednesday, 25 March 2009 20:59:49 UTC