- From: Orri Erling <erling@xs4all.nl>
- Date: Wed, 25 Mar 2009 21:50:43 +0100
- To: "'Ivan Mikhailov'" <imikhailov@openlinksw.com>, "'Seaborne, Andy'" <andy.seaborne@hp.com>
- Cc: "'Steve Harris'" <steve.harris@garlik.com>, "'SPARQL Working Group'" <public-rdf-dawg@w3.org>
All It seems to me that the whole topic is readily ameanable to experiment. The rest of the post is concerned with that. As a preamble, I would say that update, expressions, aggregation, group by and subselects are quite a bit more important than parameters. If one makes an online service that must do the same query over and over, like showing a dashboard, chances are that it is not done with the SPARQL protocol. Instead, it will be a database specific CLI which may have parameters, stored procedures, whatever. How do you go about this at Garlic? Certainly, OpenLink applications that query RDF do not have a web server talking to a triple store over SPARQL protocol. We could say with some justification that we leave such cases to vendor CLI's, where something like the Jena query execution API, also supported byOpenLink, is the analog of JDBC/ODBC. Thus we are left with the question of federation. Federation will have situations of joining between end points. Due to latency, it will be necessary to pass many requests in one round trip. I would not consider this an optimization but rather a sine qua non prerequisite of workable federated joins. Even so, there will be people who say that federated joins will not scale. What is considered sufficient will depend on the use case. As pointed out before, some end point query caching logic and HTTP/1.1 pipelining can deliver a lot of what parameters can. Since this discussion has already produced a concept for query reuse without parameters, it may well be that we implement it. The greatest mileage would be in SPARQL benchmarks for now, later in federation. The question then becomes, how much worse is this than parameters? Is the penalty of reuse without explicit parameters significant in comparison to 100 ms of latency on a wide area network? Is it significant in comparison to 200 us on a LAN? Is the cost of no reuse of plans at all significant when compared to wide area network latency? If reuse is found to be a performance factor, then what is the added implementation cost of reuse without parameters as opposed to explicit parameters? All these questions have a quantitative answer. A quick estimate can be had in a matter of a few hours. It is increasingly clear to me that parameters should be viewed in the context of being an enabler of federation, since thee app server to DBMS connection will likely go over a connected CLI which is anyway outside the scope of the present work. In the days of SAG CLI,, (SQL Access Group/Call Level Interface) clientt-server was the thing, today it is a worldwide interoperable data infrastructure. Interoperable federation is something where we do need the protocol and informally appointing Jena query executions to be the JDBC of RDF will not save us here. The justification of parameters is tied to the place of federation in the SPARQL 1.1 spec. Federation itself should be considered in the light of use cases. Whether parameters or query reuse will make or break these cases is the real question. Suppose the use case were querying social network data for purposes of authorization, as in social spam filtering. This would exhibit a high frequency of short queries, potentially ranging over many sources. Such queries would also involve joins across sources if these involved criteria like friends of friends are allowed to read and if we accepted that not all knows relations were on the same end point. Also, things like calculating search ranks involve joining essentially everything with everything, across partitions and servers: Regardless of the specific rank function, attributes of the referer affect the value of the reference. Except for special cases where the issuer of the queries has in depth knowledge of data colocation, federation will involve short query fragments sent to many places. These fragments must come in large batches to overcome latency. The larger the batch, the more benefit there is in optimizing away query compilation, since the network latency is counted once and the query compilation time is counted per row of parameters. So, to bring some real world data into the question, I would ask implementors to disclose some metric of query compilation cost vs. execution cost for a query of the form select ?o from <xx> where { <s> ?p ?o }, where <s> varies between executions. As a data set, we could have some scale of LUBM to make matters simple. Then we contrast this against the wide area network latencies in the federation use cases that we end up tackling and we are on an objective footing. These considerations would defer the decision on parameters and would couple this with federation. We do note that even if federation were not tackled in SPARQL 1.1, the presence/absence of parameters might still impact the performance of federation by a significant amount. We will give some numbers to back this position in the next weeks. It is to be demonstrated whether parameters are a make or break point for federation. We shall see. Orri
Received on Wednesday, 25 March 2009 20:59:49 UTC