W3C home > Mailing lists > Public > public-rdf-dawg@w3.org > October to December 2009

Re: Protocol extensions for federated querying

From: Andreas Langegger <al@jku.at>
Date: Thu, 22 Oct 2009 17:35:28 +0200
Cc: Paul Gearon <gearon@ieee.org>, "public-rdf-dawg@w3.org" <public-rdf-dawg@w3.org>
Message-Id: <A0308872-0138-4300-A932-7D60C9771357@jku.at>
To: "Seaborne, Andy" <andy.seaborne@hp.com>

On Oct 22, 2009, at 5:21 PM, Seaborne, Andy wrote:
> An alternative design is to regard the bindings as initial values  
> and evaluation is anything that's equivalent to a loop taking rows  
> one at a time, substituting for variables and evaluating the query.   
> This is nearly the same as a join except when certain nested  
> optionals forms are in the query.  It’s a bit more in keeping with  
> streaming the bindings in while streaming results out.  I think this  
> is what is called a "bind join" in the Garlic (at IBM) work from  
> some time ago.

oh yes, why not fully stream. At the moment I do row blocking with a  
special OpRepeatApply that takes multiple bindings from the left to do  
blocked substitution and bind-join (yes, like in IBM Garlic), but  
streaming should be possible if the BINDINGS part is at the end and  
the query plan can be constructed already before parsing until the end  
of the query string. However, that would require have changes and  
callbacks between the parser and query engine, challenging but maybe  
fun to try this.

For similar reasons (separation of ARQ parser and query engine) I  
decided to use a materialized blocking approach based on OpTable  
instead of setting initial bindings when constructing the execution.  
When parsing the bindings, we have to store the bindings as part of  
the Query object and thus, materialize anyway before it is handed over  
to the execution.

The modified parser sets initial bindings table in Query and then, in  
AlgebraGenerator I have:

     public Op compile(Query query) {
         Op pattern = compile(query.getQueryPattern()) ;     // Not  
compileElement - may need to apply simplification.
         TableN initBindings = query.getInitialBindingTable();
         if (initBindings != null)
         	pattern = OpJoin.create(pattern,  
         Op op = compileModifiers(query, pattern) ;
         return op ;

(...OpJoin will become OpSequence after TransformJoinStrategy)


> 	Andy
>> For scalable federation over public SPARQL endpoints I'm however more
>> than sceptical since I've done much research and experiments towards
>> this direction. My SemWIQ [1] mediator is working with patched
>> endpoints only that support SPARQL BINDINGS and RDFStats [2]. I think
>> issuing COUNT queries before may not scale well. Initial bindings
>> mainly reduce the latency times for HTTP connections, but it does  
>> only
>> linearly speed up federation. If there are many distributed joins,
>> even bind joins (dynamic optimization by substitution) becomes
>> troublesome...
>> Regards,
>> Andy
>> [1] http://semwiq.sourceforge.net
>> [2] http://rdfstats.sourceforge.net
>> On Oct 20, 2009, at 9:51 PM, Paul Gearon wrote:
>>> Hi everyone,
>>> This meets the commitment I made for ACTION-124.
>>> So far, all the comments I've seen on federated queries have been
>>> about the suggested query syntax. To date I'm in agreement with what
>>> I've seen proposed.
>>> I am also interested in extending the protocol to support  
>>> federation a
>>> little better. At the moment, all queries are done as a simple  
>>> request
>>> via a GET or a POST. In the case of POST, the endpoint alone is
>>> provided in the URL, and the query appears in the body.
>>> I'd like to see a form of POST that includes a SPARQL variable  
>>> binding
>>> result in the body (a la http://www.w3.org/TR/rdf-sparql-XMLres/).  
>>> In
>>> this way the receiving query engine can work with prebindings that  
>>> are
>>> provided to it, allowing it to reduce the result that is to be
>>> streamed back to the calling engine.
>>> To give an example, I'll reference the two datasets found in 8.3 of
>>> the SPARQL Query Language document:
>>> http://www.w3.org/TR/rdf-sparql-query/#queryDataset
>>> If we make the presumption that the named graph
>>> http://example.org/foaf/aliceFoaf can be found at
>>> http://sparql.org/sparql/, then I might want to issue the following
>>> query to get the names of people whose nicknames are in the bobFoaf
>>> graph:
>>> SELECT ?nick ?name
>>> FROM <http://example.org/foaf/bobFoaf>
>>> WHERE {
>>> ?p1 foaf:nick ?nick .
>>> ?p1 foaf:mbox ?mbox
>>> SERVICE <http://sparql.org/sparql/> {
>>>  SELECT ?mbox ?name
>>>  FROM <http://example.org/foaf/aliceFoaf>
>>>  WHERE { ?p2 foaf:mbox ?mbox . ?p2 foaf:name ?name }
>>> }
>>> }
>>> The part of the query in the SERVICE block would usually return the
>>> following:
>>> <?xml version="1.0"?>
>>> <sparql xmlns="http://www.w3.org/2005/sparql-results#">
>>> <head>
>>>  <variable name="mbox"/>
>>>  <variable name="name"/>
>>> </head>
>>> <results>
>>>  <result>
>>>    <binding name="mbox"><uri>mailto:alice@work.example</uri></
>>> binding>
>>>    <binding name="name"><literal>Alice</literal></binding>
>>>  </result>
>>>  <result>
>>>    <binding name="mbox"><uri>mailto:bob@work.example</uri></binding>
>>>    <binding name="name"><literal>Bob</literal></binding>
>>>  </result>
>>> </results>
>>> </sparql>
>>> Note that this is information for both Bob and Alice. This can  
>>> then be
>>> joined to the remainder of the query, which reduces the results to
>>> just Bob.
>>> However, a query engine may instead want to evaluate Bob first. This
>>> may be desirable if some COUNT queries have already been issued, and
>>> the query engine knows that the results of the SERVICE block will
>>> return a large number of results, while the local data would bind
>>> ?mbox to only a few values. In that case, the local binding of ?mbox
>>> could be sent along with the query (?p1 and ?nick are not necessary
>>> for the remote service). This could be accomplished using a POST  
>>> that
>>> has the query in the URL, and the bindings in the body.
>>> POST /sparql/?query=SELECT+%3Fmbox+%3Fname+FROM+%3Chttp%3A%2F
>>> %2Fexample.org%2Ffoaf%2FaliceFoaf%3E+WHERE+%7B+%3Fp2+foaf%3Ambox+
>>> %3Fmbox+.+%3Fp2+foaf%3Aname+%3Fname+%7D
>>> HTTP/1.1
>>> Content-Length: xxxxxx
>>> Content-Type: multipart/form-data;
>>> boundary=ZpwZZc62ZXXjf0InvlrBjTWNrJSp--FL
>>> Host: sparql.org
>>> Connection: Keep-Alive
>>> User-Agent: example
>>> --ZpwZZc62ZXXjf0InvlrBjTWNrJSp--FL
>>> Content-Disposition: form-data; name="query-prebinding"
>>> Content-Type: text/plain; charset=UTF-8
>>> Content-Transfer-Encoding: 8bit
>>> <?xml version="1.0"?>
>>> <sparql xmlns="http://www.w3.org/2005/sparql-results#">
>>> <head>
>>>  <variable name="mbox"/>
>>> </head>
>>> <results>
>>>  <result>
>>>    <binding name="mbox"><uri>mailto:bob@work.example</uri></binding>
>>>  </result>
>>> </results>
>>> </sparql>
>>> --ZpwZZc62ZXXjf0InvlrBjTWNrJSp--FL--
>>> With this pre-binding, the remote query engine is able to reduce  
>>> it's
>>> results to just the one for Bob, thereby cutting the returned size
>>> down by nearly half.
>>> One potential issue is for very long queries that also want to be
>>> placed into the body of a POST. In that case we could simply define
>>> the names of each section (in the example above I've used a name of
>>> "query-prebinding").
>>> What do others think? Does this proposal have merit?
>>> Regards,
>>> Paul Gearon
>> http://www.langegger.at
>> ----------------------------------------------------------------------
>> Dipl.-Ing.(FH) Andreas Langegger
>> FAW - Institute for Application-oriented Knowledge Processing
>> Johannes Kepler University Linz
>> A-4040 Linz, Altenberger Straße 69

Dipl.-Ing.(FH) Andreas Langegger
FAW - Institute for Application-oriented Knowledge Processing
Johannes Kepler University Linz
A-4040 Linz, Altenberger Straße 69
Received on Thursday, 22 October 2009 15:36:14 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:00:58 UTC