Re: C-SPARQL Engine and handling of REGISTER, nested queries from Mark Feblowitz on 2014-10-14 (public-rsp@w3.org from October 2014)

From: Mark Feblowitz <MarkFeblowitz@comcast.net>
Date: Tue, 14 Oct 2014 16:18:06 -0400
To: Emanuele Della Valle <emanuele.dellavalle@polimi.it>
Cc: Andy Seaborne <andy@apache.org>, Marco Balduini <marco.balduini@polimi.it>, "public-rsp@w3.org" <public-rsp@w3.org>
Message-Id: <753703A9-29A3-481C-AF77-6C02977B30AE@comcast.net>
Thanks for the fast turn-around.

I'm looking at your recommended references.

Top-line comments here:

First, the chaining works as I would have expected. Thanks for the how-to pointer, as I would not have figured that out myself. And the proxy seems to prevent a good bit of manual coding.

I’m still a bit confused about the names of the streams (“UpStreamQuery“  in "REGISTER STREAM UpStreamQuery AS“). Are the names used anywhere? Is it the intent to eventually use them to form the FROM STREAM uri or even to provide another form, e.g., FROM NAMED STREAM?  If not, then perhaps you might want to consider it a shorthand for the proxy business. I was thinking that I would start with a registered named query (REGISTER STREAM UpStreamQuery AS) and then subscribe to that stream in a subsequent query by, e.g., some uri  with "/UpStreamQuery“ concatenated. Or perhaps by expressing FROM NAMED STREAM UpStreamQuery…

Next: Save for a few minor detailed differences your example looks quite similar to my SPARQL. In a separate message, I’ll send my real queries.

I’m not sure I understand the last example - the one that repeats this subquery a second time and adds a LIMIT 1.

> { SELECT ?region {
>   WHERE {
>   ?infectee a Infected .
> ?infectee livesIn ?region } LIMIT 1 }

I have tried something like the following and it seems to work for me (curly braces might be slightly different):
 
> REGISTER QUERY infecteeInRegionWithPossibleEpidemic AS
> SELECT ?region ?infectee
> FROM STREAM …
> WHERE {
>   { SELECT ?region {
>   WHERE {
>   ?infectee a Infected .
>   ?infectee livesIn ?region
>   }  GROUP BY ?region
>   HAVING (COUNT(?infectee) > %%threshold%%) 
>  LIMIT 1 }}



As my version of this does seem to deliver the the results I’m after, the big question is the elided windowing part: the "FROM STREAM …”. I assume that some windowing scheme is required, at least to trigger processing. The tricky part is to get the windowing criteria “right”, to capture the circumstances described, and to *not* violate the overall LIMIT 1, even if sufficient infectees are in a subsequent window when processing is triggered.

In a realistic setting I might want to have a long overall monitoring period (one month or one year) and to trigger processing for a much shorter timeframe - perhaps processing daily or hourly (not so good for demos, but good for conveying an understanding). With a 1 year window and a 1 hour slide, repeated processing would almost certainly emit several of each type of notification. Unless, that is, I was able to somehow assert a “completed” statement :(

On the other hand, a tumbling window of 1 hour would miss an epidemic that formed over a day.

I guess this gets to the heart of stream processing, and perhaps to the heart of the adequacy of the supported windowing schemes (or my understanding of them :-S ).  

Next: As TRIPLE based windows are buggy, I guess I’ll approximate the same by having very fine-grained time-based windows. 

Next: If I understand correctly, counting per group seems to achieve partitioned windows. What am I missing?



Thanks, again,

Mark
 
On Oct 14, 2014, at 1:24 AM, Emanuele Della Valle <emanuele.dellavalle@polimi.it> wrote:

> Dear Mark,
> 
> On 13 Oct 2014, at 23:42, Mark Feblowitz <MarkFeblowitz@comcast.net> wrote:
> 
>> 
>> Emanuele, and @RSP Community - 
>> 
>> I have some questions, based on a specific item I am trying to implement in my work at IBM Research.
>> 
>> My C-SPARQL questions are: 
>> 
>> Is there a means of performing simple filtration on each triple in a stream? (I’m thinking, PHYSICAL window of size 1)
> 
> Triple based windows are a buggy. 
> 
> 
>> Can a query to FILTER a stream be composed with a subquery that is also windowed (using different windowing criteria)?
> 
> no, as in SPARQL the FROM clause cannot appear in subqueries. You can achieve the same result by composing queries in a query network. You put the sub-query upstream to the query that contains it.
> 
>> Using the C-SPARQL engine, how does one take the output of a REGISTERed STREAM query and use it as input to another? Is there a special URL? Or must the user-defined result processor post a new stream, identifying a new URL?
> 
> 
> You need to use the RDFStreamFormatter. See slide 9 and 10 in  http://www.streamreasoning.org/slides/2013/04/corso_dott_ifp_c-sparql.pdf 
> 
> You can also check out the COMPOSABILITY test in the https://github.com/streamreasoning/CSPARQL-ReadyToGoPack
> 
> 
>> For GROUPed windowed processing, is it correct to assume that the effect is as if there is a window per group?
> 
> The window creates the dataset that you evaluate the group on.
> 
>> How in general can one express a case where only one solution is emitted per group, in a GROUP BY … HAVING query?
> 
> I’m not sure this is possible in SPARQL. If it is not possible in SPARQL it is not possible in C-SPARQL. 
> 
> Are you trying to implement a partitioned window? Please check out this link http://esper.codehaus.org/tutorials/solution_patterns/solution_patterns.html#expiry-3
> 
> C-SPARQL does not support this clause, but indeed it is very useful in many cases.
> 
> 
>> That’s one solution only, not one solution per processing pass.
>> 
>> Here’s the simplified scenario: 
>> 
>> Examine a stream of arbitrary RDF triples, looking for Infectees — Persons infected by a particular virus. These infectees are grouped by Region. 
>> A triple set  is to be CONSTRUCTed  when there are  0 < N < threshold  infectees in a given region (“SomeInfectees” alert)
>> Another triple set is to be CONSTRUCTed when N >= threshold (“PossibleEpidemic” alert) 
> 
> this appears doable in C-SPARQL. You need to queries registered on the some stream
> 
> 	
> CONSTRUCT { [ ] someInfecteeAlertIn ?region }
> FROM STREAM …
> WHERE {
>   ?infectee a Infected .
>   ?infectee livesIn ?region
> }  GROUP BY ?region
>   HAVING (COUNT(?infectee) > 0 && COUNT(?infectee) < %%threshold%%)
> 
> REGISTER STREAM PossibleEpidemic AS
> CONSTRUCT { [ ] PossibleEpidemicIn ?region }
> FROM STREAM …
> WHERE {
>   ?infectee a Infected .
>   ?infectee livesIn ?region
> }  GROUP BY ?region
>   HAVING (COUNT(?infectee) > %%threshold%%)
> 
> As far as I know the IF function (http://www.w3.org/TR/sparql11-query/#func-if) cannot be used in a CONSTRUCT clause, otherwise it could have been possible to write just one query.
> 
>> The goal here is to process an arbitrary stream of triples and to emit just a single alert - ever - per group.
>> 
>> So, there are two issues here: 
>> filtering a stream and then applying windowed (?) match criteria for each expression for groups in the filtered result
>> ensuring that only one answer per expression per group results in a CONSTRUCT
> 
> Let me try to understand. The following query should pickup exactly one infectee per region with Possible Epidemic alert. 
> 
> REGISTER QUERY infecteeInRegionWithPossibleEpidemic AS
> SELECT ?region ?infectee
> FROM STREAM …
> WHERE {
>   { SELECT ?region {
>   WHERE {
>   ?infectee a Infected .
>   ?infectee livesIn ?region
>   }  GROUP BY ?region
>   HAVING (COUNT(?infectee) > %%threshold%%) }
>   { SELECT ?region {
>   WHERE {
>   ?infectee a Infected .
>   ?infectee livesIn ?region } LIMIT 1 }
> }
> 
> is this what you want?
> 
>> 
>> As for item #1, I’ve tried a few things and now understand that I need to view this as a filter part and a aggregate or join part. I am thinking about these ways to handle this: 
>> 
>> register a C-SPARQL stream query (window size = 1, slide = 1) to perform the filtering, feeding it to a another (window size and slide TBD);  the latter query notifies by emitting a CONSTRUCTed result.
>> 
>> arbitrary triple  stream —> [ PHYSICAL WINDOWed FILTER ] —>  filtered triple stream —>  [ PHYSICAL WINDOWED JOIN and or AGGREGATE  ] —> CONSTRUCTed triple stream
>> or 
>> compose a query whereby windowing is performed for the initial filtering and a subquery with separate windowing is performed for the aggregation/join (is this possible?)
>> 
>> In either case, the first part filters, e.g., down to a stream consisting only of infectees and the second part groups the infectees by region, counts them and emits the single respective notification. (SomeInfecteesInRegion and PotentialEpidemic). Thus, the questions above about composition of queries.
> 
> I believe I gave you already too many option. I leave this to answer once I read your answers.
> 
>> As for item 2 above, the obvious question: Will "LIMIT 1" limit the query to being matched one time only (per group) or does it mean that only one CONSTRUCT will be emitted each time the processing criteria are met (that is, when the window closes)?
> 
> See my query with two subqueries above.
> 
>> If it’s the former, I’m done. If the latter, I don’t see a way to my goal.
> 
> I may have misunderstood :-(
> 
>> 
>> It’s easy to think of this procedurally: look at non-finite data until an expression is matched and stop there. Or in a stream-ish approach, match the expression and deduplicate the output stream. Or, less cleanly, asserting the notifications to an RDF store and then including in the join expression a check for a prior alert before emitting one? Only the last one seems obvious with C-SPARQL (albeit “dirty”). 
>> 
>> Is there a clean way of doing this?
> 
> Let’s see. I’m curious too. Indeed this may require to extend the language and this is something I’m looking for :-)
> 
> Best Regards,
> 
> Emanuele
>
Received on Tuesday, 14 October 2014 20:18:38 UTC