RE: On Parameters from Ezzat, Ahmed on 2009-03-25 (public-rdf-dawg@w3.org from January to March 2009)

From: Ezzat, Ahmed <Ahmed.Ezzat@hp.com>
Date: Wed, 25 Mar 2009 21:37:16 +0000
To: Orri Erling <erling@xs4all.nl>, 'Ivan Mikhailov' <imikhailov@openlinksw.com>, "Seaborne, Andy" <andy.seaborne@hp.com>
CC: 'Steve Harris' <steve.harris@garlik.com>, 'SPARQL Working Group' <public-rdf-dawg@w3.org>
Message-ID: <3B7AE9BA67C72B4891EF21842246A21C4132C30D9E@GVW1097EXB.americas.hpqcorp.net>
Hello,

Wanted to introduce myself as new member if this WG from HP.  My background covers many areas but SQL is a big part of my activities. Recently data integration is attracting me and hence semantic web...

Here is my input on this thread:
I definitely support having an array/vector (we call it in our database "row set") as parameters.  In SQL it proved to be useful and almost all databases support that at the CLI level as Orri mentioned.

Regarding effectively prepared statement vs query caching.  In our experience if you have good query cache the value of prepared statement is minimal (not worth it).  However there are some differences at least in our MPP environment between the two capabilities:

1.      For prepared statement, all tables remain open after query completion while with query cache that typically is not true.

2.      For prepared statement downloading query fragments to appropriate query execution processes is done once while with query cache you will need to IPC the query fragments every time you reuse the same query plan from the cache.

In our experience, efficient query cache is good enough...
Regards,

Ahmed


Ahmed K. Ezzat, Ph.D.
HP Fellow, Business Intelligence Software Division
Hewlett-Packard Corporation
11000 Wolf Road, Bldg 42 Upper, MS 4502, Cupertino, CA 95014-0691
Office:      Email: Ahmed.Ezzat@hp.com<mailto:Ahmed.Ezzat@hp.com> Off: 408-447-6380  Fax: 1408796-5427  Cell: 408-504-2603
Personal: Email: AhmedEzzat@aol.com<mailto:AhmedEzzat@aol.com> Tel: 408-253-5062  Fax:  408-253-6271



-----Original Message-----
From: public-rdf-dawg-request@w3.org [mailto:public-rdf-dawg-request@w3.org] On Behalf Of Orri Erling
Sent: Wednesday, March 25, 2009 1:51 PM
To: 'Ivan Mikhailov'; Seaborne, Andy
Cc: 'Steve Harris'; 'SPARQL Working Group'
Subject: RE: On Parameters



All

It seems to me that the whole topic is readily ameanable to experiment.  The
rest of the post is concerned with that.
As a preamble, I would say that update, expressions, aggregation, group by
and subselects are quite a bit   more important than parameters.



If one makes an online service that must do the same query over and over,
like showing a dashboard, chances are that it is not done with the
SPARQL protocol.  Instead, it will be a database specific CLI which may have
parameters, stored procedures, whatever.  How do you go about
this at Garlic?  Certainly, OpenLink applications that query RDF do not have
a web server talking to  a triple store over  SPARQL  protocol.

 We could say with some justification that we leave such cases to vendor
CLI's, where something like the Jena query execution API, also
supported byOpenLink,  is the analog of JDBC/ODBC.


Thus we are left with the question of federation.  Federation will have
situations of joining between end points.  Due to latency, it will
be necessary to pass many requests in one round trip.  I would not consider
this an optimization but rather a sine qua non prerequisite of
workable federated joins.  Even so, there will be people who say that
federated joins will not scale.  What is considered sufficient will
depend on the use case.

As pointed out before, some end point query caching logic and HTTP/1.1
pipelining can deliver a lot of what parameters can.

Since this discussion has already produced a concept for query reuse without
parameters, it may well be that  we implement it.    The
greatest mileage would be in SPARQL benchmarks for now, later in federation..

The question then becomes, how much worse is this than parameters?  Is the
penalty of reuse without explicit parameters significant in
comparison to 100 ms of latency on a wide area network?  Is it significant
in comparison to 200 us on a LAN?  Is the cost of no reuse of
plans at all significant when compared to wide area network latency?  If
reuse is found to be a performance factor, then what is the added
implementation cost of reuse without parameters as opposed to explicit
parameters?

All these questions have a quantitative answer.  A quick estimate can be had
in a matter of a few hours.

It is increasingly clear to me that parameters should be viewed in the
context of being an enabler of federation, since thee  app server to
DBMS connection will likely go over a connected CLI which is anyway outside
the scope of the present work.  In the days of  SAG CLI,, (SQL
Access Group/Call Level Interface)  clientt-server was the thing, today it
is a worldwide interoperable data infrastructure.

Interoperable federation is something where we do need the protocol and
informally appointing  Jena query executions to be the JDBC of RDF
will not save us here.

The justification of parameters is tied to the place of federation in the
SPARQL 1.1 spec.
Federation itself should be considered in the light of use cases.  Whether
parameters or query reuse will make or break these cases is the
real question.  Suppose the use case were querying social network data for
purposes of authorization,  as in social spam filtering.  This
would exhibit a high frequency of short queries, potentially ranging over
many sources.  Such queries would also involve joins across
sources if these involved criteria like friends of friends are allowed to
read and if we accepted that not all  knows relations were on the
same end point.

Also, things like calculating search ranks involve joining essentially
everything with everything, across  partitions and servers:
Regardless of the specific rank function, attributes of the referer affect
the value of the reference.

Except for special cases where the issuer of the queries has in depth
knowledge of data colocation, federation will involve short query
fragments sent to many places. These fragments must come in large batches to
overcome latency.  The larger the batch, the more benefit there
is in optimizing away query compilation, since the network latency is
counted once and the query compilation time is counted per row of
parameters.


So, to bring some real world data into the question, I would ask
implementors to disclose some metric of query compilation cost vs.
execution cost for a query of the form select ?o from <xx> where  { <s> ?p
?o }, where <s> varies between executions.  As a data set, we
could have some scale of LUBM to make matters simple.

Then we contrast this against the wide area network latencies in the
federation use cases that we end up tackling and we are on an objective
footing.

These considerations would defer the decision on parameters and would couple
this with federation.  We do note that even if federation were
not tackled in SPARQL 1.1, the presence/absence of parameters might still
impact the performance of federation by a significant amount.

We will give some numbers to back this position in the next weeks.  It is to
be demonstrated whether parameters are a make or break point for federation..
We shall see.


Orri
Received on Wednesday, 25 March 2009 21:41:39 UTC