Using C-SPARQL with a Fuseki Endpoint "FROM" source. (was Fwd: C-SPARQL interest and questions) from Mark Feblowitz on 2014-10-17 (public-rsp@w3.org from October 2014)

From: Mark Feblowitz <MarkFeblowitz@comcast.net>
Date: Fri, 17 Oct 2014 12:39:55 -0400
To: Emanuele Della Valle <emanuele.dellavalle@polimi.it>
Cc: Marco Balduini <marco.balduini@polimi.it>, Andy Seaborne <andy@apache.org>, public-rsp@w3.org
Message-Id: <2DC2B972-D47B-4EFA-A8DC-8CD2AB68D733@comcast.net>
Buried in the rather involved prior thread is a question regarding Fuseki and how C-SPARQL can pull in the contents of a given repository. As you mentioned (my emphasis added):

> I’m not familiar with fuseki:serviceConstruct. Does it return the entire content of the repository in XML-RDF format? As far as I know this is the only way you can get Jena ARQ (used internally to the C-SPARQL Engine) to load a remote graph using the FROM clause.
 
That comment was in the context of me trying to use the C-SPARQL “FROM” clause, as in the example, to pull content from Fuseki:

> FROM <http://localhost:3031/ds/sparqlc>


You mentioned that ARQ is pulling in the entire contents from the service, which gave me the information I needed to get it working. 

To use the C-SPARQL FROM clause with Fuseki, one must first enable the correct Fuseki service, adding something like the following line to the Fuseki config:

   fuseki:serviceReadGraphStore        "get" ;

and then access it using, e.g,.

FROM <http://localhost:3031/ds/get?default>

(“default” in my case, as I’m interacting with the default graph.)

Understanding this, I also believe I understand some interesting characteristics of C-SPARQL:
the entire contents are pulled in from the endpoint, regardless of size (!)
the process can take quite some time, depending on size of the endpoint’s contents, and also depending on the processing resources available, (memory, cpu, network bandwidth, etc.)
this appears to be a one-time snapshot and, if so, the store contents would be fixed at the time the C-SPARQL engine retrieves them.

Knowing this is important: use of a SPARQL endpoint as a “live” resource cannot be relied upon. That’s probably ok, especially since this is a continuous query system and as such  could handle updates in a streaming manner.

There are of course real challenges to accessing this content any other way. In fact, how you’re doing this seems consistent with what we teach our Streams SPL developers: caching is preferred over fine-grained remote queries, as every “query out” to look up external content during stream processing adds significant latency.

Questions that this provokes:
Is it correct that the full contents are pulled from the source at only one time?
When does this happen? At query registration time? At the first time processing is triggered?
For large content pulls, are stream triple arrivals held off until the source contents (from all non-stream sources) are completely read in, or must our code make provisions to wait?
Is the cache ever refreshed  (or should it be) when the source content is updated? (could be done in background, but not a simple thing, I should think)
If not, I’m thinking updates could be handled by posting triples to both the endpoint and to another stream, for future reference. 

That last case is what I might call a “reference window” pattern — a stream that accumulates reference contents that are never removed. For that content to be available for all subsequent queries, a (nearly) infinite window duration would be needed. This is where very large count-based tumbling windows would be very useful.

This approach brings up questions on triggers to processing:
does processing occur when any of a query’s stream inputs has a window step complete? 
If so, do all of the streams' current windows supply triples for the query evaluation, even if no processing trigger has occurred for the other streams?


Regards, and Thanks,
Mark
 

Begin forwarded message:

> Resent-From: public-rsp@w3.org
> From: Emanuele Della Valle <emanuele.dellavalle@polimi.it>
> Subject: Re: C-SPARQL interest and questions
> Date: October 14, 2014 at 12:49:08 AM EDT
> To: Mark Feblowitz <MarkFeblowitz@comcast.net>
> Cc: Emanuele Della Valle <emanuele.dellavalle@polimi.it>, Andy Seaborne <andy@apache.org>, Marco Balduini <marco.balduini@polimi.it>, "public-rsp@w3.org" <public-rsp@w3.org>
> 
> Dear Mark,
> 
> I try to answer inline.
> 
> 
> On 09 Oct 2014, at 23:23, Mark Feblowitz <MarkFeblowitz@comcast.net> wrote:
> 
>> I will send in comments on the items not working or not documented. Is there a recent language reference that is consistent with the engine’s implementation? That would be quite helpful.
> 
> No. The best material we have is the RSP tutorial: http://streamreasoning.org/rsp2014
> More about the naive reasoning and the background knowledge support will appear in http://www.streamreasoning.org/events/sr4ld2014
> 
> Generally speaking you should be able to register any SPARQL 1.1 query. 
> 
>> Are the more recent builds as stable as the May 2014 release? If so, I will happily pick up that latest stable build.
> 
> As I said, we are working on it.
> 
>> As for the exception trace below. I was trying to use these FROM elements in my continuous queries:
>> 
>> REGISTER ...
>> PREFIX …
>> CONSTRUCT { … } 
>> FROM STREAM <http://km.sp.ibm.com/cqeStream>  [ RANGE 30m STEP 15s ] 
>> FROM <http://localhost:3031/km4sp/sparqlc>
>> WHERE { 
>>  { SELECT ?G
>> ...
>>  }
>> GROUP BY ?G
>> HAVING (  COUNT (DISTINCT  ?L ) > 5
>> }
>> 
>> The FROM <uri> is a Fuseki endpoint that I am hoping to use for C-SPARQL querying of “at-rest" triples. 
> 
> the FROM clause, specify the RDF dataset as in SPARQL, but it does not support NAMED clause. If http://localhost:3031/km4sp/sparqlc returns an RDF, then the query should work. 
> 
> What does ?g match? Can you send the full query with some example of data?
> 
>> The exception mentions update, but that’s not what the engine is trying to do (I don’t think). When I use a file URI, the query functions as expected. 
>> 
>> Is there something about Fuseki’s service that makes it incompatible with the May 2014 Engine and SPARQL server access?
> 
> May be. 
> 
> 
>> Might it be fixed in a more recent build? Or am I pointing the engine at the wrong type of service?
>>  
>> <#service1>  rdf:type fuseki:Service ;
>>     fuseki:name             
>> "km4sp" ;             
>>     fuseki:serviceQuery
>> "sparql" ;          
>>     fuseki:serviceConstruct  
>> "sparqlc" ;        
>>     fuseki:serviceUpdate     
>> "update" ;
>>     fuseki:serviceUpload     
>> "upload" ;
>>     fuseki:dataset           
>> <#dataset> ;
>> 
>> I’ve tried pointing it at the Query service (“sparql”), the Update service (“update”), and the Construct service ("sparqlc”), all with the same outcome. I’ve also tried an IP address rather than “localhost”
> 
> I’m not familiar with fuseki:serviceConstruct. Does it return the entire content of the repository in XML-RDF format? As far as I know this is the only way you can get Jena ARQ (used internally to the C-SPARQL Engine) to load a remote graph using the FROM clause.
> 
> The new version of the engine will come with an in-memory dataset where you can store and update the background knowledge. I recommend you to try this new feature. The hands on session of http://www.streamreasoning.org/events/sr4ld2014 will include an example of how to use this feature.
> 
> 
>> As for the duplicate result issue, it appears that, in the example above, each time the window closes (every 15s) , the engine appears to find the criteria in the window to identify  N > 5 distinct ?L for the group ?G and emits the CONSTRUCT clause, until such time as the  N becomes less than 5 (the slide eventually eliminates enough ?L’s). I’m not sure whether that’s the desired behavior. It may be that there are no windowing criteria that would call for the match to happen only once. 
> 
> I still do not fully get the point. Probably I need to see the entire query to understand.
> 
>> What I’m thinking of doing is either removing duplicate emitted “CONSTRUCT” clauses or amending my query criteria to allow for only a single match. 
>> 
>> As I write this, though, it does seem like a streaming/windowing FAQ that, as I learn more about this particular form of windowed aggregation I will learn how to address. And I’m just learning how C-SPARQL handles its “ejection criteria" (what gets ejected from the window) and its “processing criteria” (which I now understand to be “window closing” criteria, in your words). IBM Streams and, I imagine, Esper, have different handling of each.
> 
> Indeed.
> 
> Best Regards,
> 
> Emanuele
>
Received on Friday, 17 October 2014 16:40:40 UTC