Windowing Wishlist (was Re: C-SPARQL Engine and handling of REGISTER, nested queries) from Mark Feblowitz on 2014-11-03 (public-rsp@w3.org from November 2014)

From: Mark Feblowitz <MarkFeblowitz@comcast.net>
Date: Mon, 3 Nov 2014 12:13:47 -0500
To: Emanuele Della Valle <emanuele.dellavalle@polimi.it>
Cc: Andy Seaborne <andy@apache.org>, Marco Balduini <marco.balduini@polimi.it>, "public-rsp@w3.org" <public-rsp@w3.org>
Message-Id: <96A3D0CE-2D01-455F-ABD3-EA517E47CB34@comcast.net>
Emanuele -

I finally found a way to achieve what I was trying to, given the capabilities in the May 2013 C-SPARQL release. It is not pretty, but it does exactly what I had wanted.

Specifically, here’s what I have done:

I needed to notify *once only* when the count per group...
was within some range ( 0 < X.v <= T)
went above the threshold T of the range (X.v > T)

The only way I could figure out how to achieve this was a bit tricky: I created a loopback from the output of a query and tested for the absence of the notification. Here’s what it looks like, as a picture:



Here, the teal boxes represent the queries and the white boxes represent labels on the streams (i.e., the emitted triples). If you’d like, I can send you the C-SPARQL queries.

Note the pre-filtration, via the X Filter Query, and the two separate tests on that filtered stream (X where…). That’s for efficiency, especially as it significantly reduces the set of triples going into each downstream query. Thanks for the pointer as to how to do this.

In keeping with your recommendation to avoid count-based (“triple-based”) windows, this was done with large time-based windows and very small slides as processing triggers. The tiny slide allowed for early recognition of a threshold crossing, but was accompanied by an unpleasant side effect: there were a lot of repeat notifications. Thus the loop-back and test. There was also the risk of missing the some early notifications (if slides happens before the count is achieved).

If there is a better way to have done this, given the expressivity of May ’13 C-SPARQL or a subsequent version, I’d be more than happy to try it out.

So, on to my wish list:

First, it seems that the biggest problem is that the processing criteria (when to do the match) seem currently to be inseparable from the ejection criteria  (when to eject triples from the window). 

Without the “COMPUTED EVERY” or some other separation of processing criteria, it seems that one must use the slide or the tumble as the processing criteria. With COMPUTED EVERY being implemented, at least one might have a shot at being able to match the first instance of, e.g., a threshold crossing, before triple ejection occurs.

What would have worked better for my case is being able to specify general processing criteria separate from specifying ejection criteria, i.e., to test every second within tumbling one hour window. In general, it can be helpful to one of a number of different ways for triple arrival to trigger processing, independent of when triples are being evicted from the window.

There are other examples of various combinations of processing and eviction criteria: the ability to process every minute a snapshot average of sales receipts for one hour tumbling windows of sales (that’s different to having a 10 minute slide in a one hour window). Or perhaps you’d want to compute the average of each set of 20 triples in a tumbling one hour period. The point here is that there can be interesting combinations of count-based and time based triggering and/or eviction. Some combinations only make sense when you have the need to use them :-)  

Also quite helpful would be the prospect of being able to invoke a policy of one output per match. So, rather than having each invocation of processing result in (re)producing  every match, instead have some means of sensibly limiting matches. In the example above, it would make sense to have the arrival of a triple (specifically one that is included in the query) trigger the processing of the query against the window contents, but only in the context of that triple (so, “fix" that triple in the query and then match around it); the next triple would be different, and only emit a result if it, too, participated in the match.

I hope this makes some sense, and that it corresponds to some vocabulary that’s familiar to you. 

Much of this separation of processing criteria from ejection criteria derives from experience in using Streams’ windowing capabilities. The semantics of windowed stream operators (aggregate, join, …) address various combinations of applying each.

I’m looking around for good references for these windowing semantics. The most concise and technical (and least hyped) I’ve found so far is IBM InfoSphere Streams Version 3.2.1 Window handling <http://www-01.ibm.com/support/knowledgecenter/SSCRJU_3.2.1/com.ibm.swg.im.infosphere.streams.dev.doc/doc/windowhandling.html>  It’s written in the context of people writing their own Windowed operators (in C++) to extend SPL (Streams Processing Language), but it captures the semantics of all of Streams’ built-in windowed operators.

I hope this isn’t off-point for this forum. And I hope that at least some of this has already been addressed in your discussions.

Thanks for your indulgence and, again, for your helpful suggestions.

Regards,

Mark
 

> On Oct 14, 2014, at 1:24 AM, Emanuele Della Valle <emanuele.dellavalle@polimi.it> wrote:
> 
> Dear Mark,
> 
> On 13 Oct 2014, at 23:42, Mark Feblowitz <MarkFeblowitz@comcast.net <mailto:MarkFeblowitz@comcast.net>> wrote:
> 
>> 
>> Emanuele, and @RSP Community - 
>> 
>> I have some questions, based on a specific item I am trying to implement in my work at IBM Research.
>> 
>> My C-SPARQL questions are: 
>> 
>> Is there a means of performing simple filtration on each triple in a stream? (I’m thinking, PHYSICAL window of size 1)
> 
> Triple based windows are a buggy. 
> 
> 
>> Can a query to FILTER a stream be composed with a subquery that is also windowed (using different windowing criteria)?
> 
> no, as in SPARQL the FROM clause cannot appear in subqueries. You can achieve the same result by composing queries in a query network. You put the sub-query upstream to the query that contains it.
> 
>> Using the C-SPARQL engine, how does one take the output of a REGISTERed STREAM query and use it as input to another? Is there a special URL? Or must the user-defined result processor post a new stream, identifying a new URL?
> 
> 
> You need to use the RDFStreamFormatter. See slide 9 and 10 in  http://www.streamreasoning.org/slides/2013/04/corso_dott_ifp_c-sparql.pdf <http://www.streamreasoning.org/slides/2013/04/corso_dott_ifp_c-sparql.pdf> 
> 
> You can also check out the COMPOSABILITY test in the https://github.com/streamreasoning/CSPARQL-ReadyToGoPack <https://github.com/streamreasoning/CSPARQL-ReadyToGoPack>
> 
> 
>> For GROUPed windowed processing, is it correct to assume that the effect is as if there is a window per group?
> 
> The window creates the dataset that you evaluate the group on.
> 
>> How in general can one express a case where only one solution is emitted per group, in a GROUP BY … HAVING query?
> 
> I’m not sure this is possible in SPARQL. If it is not possible in SPARQL it is not possible in C-SPARQL. 
> 
> Are you trying to implement a partitioned window? Please check out this link http://esper.codehaus.org/tutorials/solution_patterns/solution_patterns.html#expiry-3 <http://esper.codehaus.org/tutorials/solution_patterns/solution_patterns.html#expiry-3>
> 
> C-SPARQL does not support this clause, but indeed it is very useful in many cases.
> 
> 
>> That’s one solution only, not one solution per processing pass.
>> 
>> Here’s the simplified scenario: 
>> 
>> Examine a stream of arbitrary RDF triples, looking for Infectees — Persons infected by a particular virus. These infectees are grouped by Region. 
>> A triple set  is to be CONSTRUCTed  when there are  0 < N < threshold  infectees in a given region (“SomeInfectees” alert)
>> Another triple set is to be CONSTRUCTed when N >= threshold (“PossibleEpidemic” alert) 
> 
> this appears doable in C-SPARQL. You need to queries registered on the some stream
> 
> REGISTER STREAM someInfecteeAlert AS
> CONSTRUCT { [ ] someInfecteeAlertIn ?region }
> FROM STREAM …
> WHERE {
>   ?infectee a Infected .
>   ?infectee livesIn ?region
> }  GROUP BY ?region
>   HAVING (COUNT(?infectee) > 0 && COUNT(?infectee) < %%threshold%%)
> 
> REGISTER STREAM PossibleEpidemic AS
> CONSTRUCT { [ ] PossibleEpidemicIn ?region }
> FROM STREAM …
> WHERE {
>   ?infectee a Infected .
>   ?infectee livesIn ?region
> }  GROUP BY ?region
>   HAVING (COUNT(?infectee) > %%threshold%%)
> 
> As far as I know the IF function (http://www.w3.org/TR/sparql11-query/#func-if <http://www.w3.org/TR/sparql11-query/#func-if>) cannot be used in a CONSTRUCT clause, otherwise it could have been possible to write just one query.
> 
>> The goal here is to process an arbitrary stream of triples and to emit just a single alert - ever - per group.
>> 
>> So, there are two issues here: 
>> filtering a stream and then applying windowed (?) match criteria for each expression for groups in the filtered result
>> ensuring that only one answer per expression per group results in a CONSTRUCT
> 
> Let me try to understand. The following query should pickup exactly one infectee per region with Possible Epidemic alert. 
> 
> REGISTER QUERY infecteeInRegionWithPossibleEpidemic AS
> SELECT ?region ?infectee
> FROM STREAM …
> WHERE {
>   { SELECT ?region {
>   WHERE {
>   ?infectee a Infected .
>   ?infectee livesIn ?region
>   }  GROUP BY ?region
>   HAVING (COUNT(?infectee) > %%threshold%%) }
>   { SELECT ?region {
>   WHERE {
>   ?infectee a Infected .
>   ?infectee livesIn ?region } LIMIT 1 }
> }
> 
> is this what you want?
> 
>> 
>> As for item #1, I’ve tried a few things and now understand that I need to view this as a filter part and a aggregate or join part. I am thinking about these ways to handle this: 
>> 
>> register a C-SPARQL stream query (window size = 1, slide = 1) to perform the filtering, feeding it to a another (window size and slide TBD);  the latter query notifies by emitting a CONSTRUCTed result.
>> 
>> arbitrary triple  stream —> [ PHYSICAL WINDOWed FILTER ] —>  filtered triple stream —>  [ PHYSICAL WINDOWED JOIN and or AGGREGATE  ] —> CONSTRUCTed triple stream
>> or 
>> compose a query whereby windowing is performed for the initial filtering and a subquery with separate windowing is performed for the aggregation/join (is this possible?)
>> 
>> In either case, the first part filters, e.g., down to a stream consisting only of infectees and the second part groups the infectees by region, counts them and emits the single respective notification. (SomeInfecteesInRegion and PotentialEpidemic). Thus, the questions above about composition of queries.
> 
> I believe I gave you already too many option. I leave this to answer once I read your answers.
> 
>> As for item 2 above, the obvious question: Will "LIMIT 1" limit the query to being matched one time only (per group) or does it mean that only one CONSTRUCT will be emitted each time the processing criteria are met (that is, when the window closes)?
> 
> See my query with two subqueries above.
> 
>> If it’s the former, I’m done. If the latter, I don’t see a way to my goal.
> 
> I may have misunderstood :-(
> 
>> 
>> It’s easy to think of this procedurally: look at non-finite data until an expression is matched and stop there. Or in a stream-ish approach, match the expression and deduplicate the output stream. Or, less cleanly, asserting the notifications to an RDF store and then including in the join expression a check for a prior alert before emitting one? Only the last one seems obvious with C-SPARQL (albeit “dirty”). 
>> 
>> Is there a clean way of doing this?
> 
> Let’s see. I’m curious too. Indeed this may require to extend the language and this is something I’m looking for :-)
> 
> Best Regards,
> 
> Emanuele
>
Received on Monday, 3 November 2014 17:14:22 UTC