Re: Windowing Wishlist from Mark Feblowitz on 2014-11-14 (public-rsp@w3.org from November 2014)

From: Mark Feblowitz <MarkFeblowitz@comcast.net>
Date: Fri, 14 Nov 2014 17:51:46 -0500
To: "Gray, Alasdair" <A.J.G.Gray@hw.ac.uk>
Cc: Emanuele Della Valle <emanuele.dellavalle@polimi.it>, Andy Seaborne <andy@apache.org>, Marco Balduini <marco.balduini@polimi.it>, "public-rsp@w3.org" <public-rsp@w3.org>
Message-Id: <7E324210-D5B6-46C9-B097-685DCB9CB3F6@comcast.net>
> On Nov 14, 2014, at 8:15 AM, Gray, Alasdair <A.J.G.Gray@hw.ac.uk> wrote:
> 
> 
> On 3 Nov 2014, at 17:31, Mark Feblowitz <MarkFeblowitz@comcast.net <mailto:MarkFeblowitz@comcast.net>> wrote:
> [snip]
>> 
>> So, on to my wish list:
>> 
>> First, it seems that the biggest problem is that the processing criteria (when to do the match) seem currently to be inseparable from the ejection criteria  (when to eject triples from the window). 
>> 
>> Without the “COMPUTED EVERY” or some other separation of processing criteria, it seems that one must use the slide or the tumble as the processing criteria. With COMPUTED EVERY being implemented, at least one might have a shot at being able to match the first instance of, e.g., a threshold crossing, before triple ejection occurs.
>> 
>> What would have worked better for my case is being able to specify general processing criteria separate from specifying ejection criteria, i.e., to test every second within tumbling one hour window. In general, it can be helpful to one of a number of different ways for triple arrival to trigger processing, independent of when triples are being evicted from the window.
>> 
>> There are other examples of various combinations of processing and eviction criteria: the ability to process every minute a snapshot average of sales receipts for one hour tumbling windows of sales (that’s different to having a 10 minute slide in a one hour window).
> 
> Could you illustrate this with a concrete example?

I thought that was concrete :-)

I’ll try to dig up some more concrete examples. (Or is it “some examples that are more concrete"?)

I’ve sent along a couple of descriptions of windowing in Streams. All of these have been implemented and all have been put to use in a variety of applications (some common, some obscure).

 It’s a fairly rich and comprehensive set of combinations of count-based and time-based trigger/processing criteria and eviction criteria. The applications are fairly diverse, allowing one to choose the combinations that fit the need. 

One can, for example, compute an average of prices for each 100 trades (count-based processing) or every 10 minutes during a 24 hour period (time-based processing) or every 100 trades within a 24 hour period. And you can separately determine whether to use a sliding window (during the last 24 hours) or a tumbling window (every day). One thing I’m not so sure about is how one would align that 24 hour period with clock time, e.g, starting from the opening time of the Madrid Stock Exchange. 

I’m thinking that the most important challenge here is to identify an adequate subset of (combinations of) triggering/processing criteria and ejection criteria that would be suitable to both the platform under consideration (e.g., C-SPARQL or its offspring) and to the range of applications envisioned. Experience will advise, as has mine with C-CPARQL and the specific challenges I encountered.

There are some very clever applications of various types of windowing. One thing is a streaming lookup capability as a windowed Join: accumulate in one time-based window a set of tuples; the other window is of size zero.  The processing is triggered by the arrival of a tuple in that zero-sized window, that tuple is evaluated against the contents of the other window and a result is emitted for each match. It is, essentially, a lookup capability for a “table” of a (potentially) moving set of values. In one case the other window (non-zero sized) can be, e.g., a 24 hour window or even a very large time-based or count-based window with the intent to simply accumulate values. 

The point here is that if you have underlying mechanisms to trigger processing for time or for count, and similar mechanisms for evicting based on time or count, you have the ability to leave it to the statistician/analyst to pick what’s needed. 

In Streams/SPL there’s an odd one that I’ve never fully wrapped my head around, but that seems desirable to some analysts: delta-based. It deals with increasing values and allows one to process when the value crosses some delta. One example I can think of is when the time aspect of a window comes not from system time but from time as carried in the data itself. One thing I can imagine would be values being sent in batches from a satellite but with timestamps that span a day or a week (and/or that reflect measurements that were taken much earlier, depending on the distance and the speed of light). If the desire is to have a time-based window that reflects the actual time of occurrence, one can use, e.g., a delta value for the timestamp in the data that reflects, say, 24 hours. So, for example, when the delta equals 24 hours, compute the average speed of the satellite for that delta.

Being a software person and not a data scientist, I can’t always imagine all of the uses for these various combinations of processing criteria and eviction criteria - I just know that very many of these are of value to the statisticians. 

> 
>> Or perhaps you’d want to compute the average of each set of 20 triples in a tumbling one hour period. The point here is that there can be interesting combinations of count-based and time based triggering and/or eviction. Some combinations only make sense when you have the need to use them :-)  
> 
> Windows based on the count of triples is very problematic as it is non-deterministic behaviour dependent upon the order and rate of the stream. 

This might be true, unless your underlying platform has mechanisms for dealing with them. As I mentioned above, if your system can count the triples as they arrive and there’s a threshold to be consider...

>> 
>> Also quite helpful would be the prospect of being able to invoke a policy of one output per match. So, rather than having each invocation of processing result in (re)producing  every match, instead have some means of sensibly limiting matches. In the example above, it would make sense to have the arrival of a triple (specifically one that is included in the query) trigger the processing of the query against the window contents, but only in the context of that triple (so, “fix" that triple in the query and then match around it); the next triple would be different, and only emit a result if it, too, participated in the match.
>> 
>> I hope this makes some sense, and that it corresponds to some vocabulary that’s familiar to you. 
>> 
>> Much of this separation of processing criteria from ejection criteria derives from experience in using Streams’ windowing capabilities. The semantics of windowed stream operators (aggregate, join, …) address various combinations of applying each.
>> 
>> I’m looking around for good references for these windowing semantics. The most concise and technical (and least hyped) I’ve found so far is IBM InfoSphere Streams Version 3.2.1 Window handling <http://www-01.ibm.com/support/knowledgecenter/SSCRJU_3.2.1/com.ibm.swg.im.infosphere.streams.dev.doc/doc/windowhandling.html>  It’s written in the context of people writing their own Windowed operators (in C++) to extend SPL (Streams Processing Language), but it captures the semantics of all of Streams’ built-in windowed operators.
>> 
>> I hope this isn’t off-point for this forum. And I hope that at least some of this has already been addressed in your discussions.
> 
> These sort of detailed requirements for windows is very much on topic and up for discussion.
> 

Thank you. Having been involved stream computing since 2003 and in the semantic communities for even longer, I am delighted to see them being considered together. There’s a lot of literature and experience we can draw on, and there are platforms that can provide some insights into the options that can be applied, one to the other.

I’d be honored to be included in these discussions.

Best,

Mark


> Best regards
> 
> Alasdair
>> 
>> Thanks for your indulgence and, again, for your helpful suggestions.
>> 
>> Regards,
>> 
>> Mark
>>  
>> 
>>> On Oct 14, 2014, at 1:24 AM, Emanuele Della Valle <emanuele.dellavalle@polimi.it <mailto:emanuele.dellavalle@polimi.it>> wrote:
>>> 
>>> Dear Mark,
>>> 
>>> On 13 Oct 2014, at 23:42, Mark Feblowitz <MarkFeblowitz@comcast.net <mailto:MarkFeblowitz@comcast.net>> wrote:
>>> 
>>>> 
>>>> Emanuele, and @RSP Community - 
>>>> 
>>>> I have some questions, based on a specific item I am trying to implement in my work at IBM Research.
>>>> 
>>>> My C-SPARQL questions are: 
>>>> 
>>>> Is there a means of performing simple filtration on each triple in a stream? (I’m thinking, PHYSICAL window of size 1)
>>> 
>>> Triple based windows are a buggy. 
>>> 
>>> 
>>>> Can a query to FILTER a stream be composed with a subquery that is also windowed (using different windowing criteria)?
>>> 
>>> no, as in SPARQL the FROM clause cannot appear in subqueries. You can achieve the same result by composing queries in a query network. You put the sub-query upstream to the query that contains it.
>>> 
>>>> Using the C-SPARQL engine, how does one take the output of a REGISTERed STREAM query and use it as input to another? Is there a special URL? Or must the user-defined result processor post a new stream, identifying a new URL?
>>> 
>>> 
>>> You need to use the RDFStreamFormatter. See slide 9 and 10 in  http://www.streamreasoning.org/slides/2013/04/corso_dott_ifp_c-sparql.pdf <http://www.streamreasoning.org/slides/2013/04/corso_dott_ifp_c-sparql.pdf> 
>>> 
>>> You can also check out the COMPOSABILITY test in the https://github.com/streamreasoning/CSPARQL-ReadyToGoPack <https://github.com/streamreasoning/CSPARQL-ReadyToGoPack>
>>> 
>>> 
>>>> For GROUPed windowed processing, is it correct to assume that the effect is as if there is a window per group?
>>> 
>>> The window creates the dataset that you evaluate the group on.
>>> 
>>>> How in general can one express a case where only one solution is emitted per group, in a GROUP BY … HAVING query?
>>> 
>>> I’m not sure this is possible in SPARQL. If it is not possible in SPARQL it is not possible in C-SPARQL. 
>>> 
>>> Are you trying to implement a partitioned window? Please check out this link http://esper.codehaus.org/tutorials/solution_patterns/solution_patterns.html#expiry-3 <http://esper.codehaus.org/tutorials/solution_patterns/solution_patterns.html#expiry-3>
>>> 
>>> C-SPARQL does not support this clause, but indeed it is very useful in many cases.
>>> 
>>> 
>>>> That’s one solution only, not one solution per processing pass.
>>>> 
>>>> Here’s the simplified scenario: 
>>>> 
>>>> Examine a stream of arbitrary RDF triples, looking for Infectees — Persons infected by a particular virus. These infectees are grouped by Region. 
>>>> A triple set  is to be CONSTRUCTed  when there are  0 < N < threshold  infectees in a given region (“SomeInfectees” alert)
>>>> Another triple set is to be CONSTRUCTed when N >= threshold (“PossibleEpidemic” alert) 
>>> 
>>> this appears doable in C-SPARQL. You need to queries registered on the some stream
>>> 
>>> REGISTER STREAM someInfecteeAlert AS
>>> CONSTRUCT { [ ] someInfecteeAlertIn ?region }
>>> FROM STREAM …
>>> WHERE {
>>>   ?infectee a Infected .
>>>   ?infectee livesIn ?region
>>> }  GROUP BY ?region
>>>   HAVING (COUNT(?infectee) > 0 && COUNT(?infectee) < %%threshold%%)
>>> 
>>> REGISTER STREAM PossibleEpidemic AS
>>> CONSTRUCT { [ ] PossibleEpidemicIn ?region }
>>> FROM STREAM …
>>> WHERE {
>>>   ?infectee a Infected .
>>>   ?infectee livesIn ?region
>>> }  GROUP BY ?region
>>>   HAVING (COUNT(?infectee) > %%threshold%%)
>>> 
>>> As far as I know the IF function (http://www.w3.org/TR/sparql11-query/#func-if <http://www.w3.org/TR/sparql11-query/#func-if>) cannot be used in a CONSTRUCT clause, otherwise it could have been possible to write just one query.
>>> 
>>>> The goal here is to process an arbitrary stream of triples and to emit just a single alert - ever - per group.
>>>> 
>>>> So, there are two issues here: 
>>>> filtering a stream and then applying windowed (?) match criteria for each expression for groups in the filtered result
>>>> ensuring that only one answer per expression per group results in a CONSTRUCT
>>> 
>>> Let me try to understand. The following query should pickup exactly one infectee per region with Possible Epidemic alert. 
>>> 
>>> REGISTER QUERY infecteeInRegionWithPossibleEpidemic AS
>>> SELECT ?region ?infectee
>>> FROM STREAM …
>>> WHERE {
>>>   { SELECT ?region {
>>>   WHERE {
>>>   ?infectee a Infected .
>>>   ?infectee livesIn ?region
>>>   }  GROUP BY ?region
>>>   HAVING (COUNT(?infectee) > %%threshold%%) }
>>>   { SELECT ?region {
>>>   WHERE {
>>>   ?infectee a Infected .
>>>   ?infectee livesIn ?region } LIMIT 1 }
>>> }
>>> 
>>> is this what you want?
>>> 
>>>> 
>>>> As for item #1, I’ve tried a few things and now understand that I need to view this as a filter part and a aggregate or join part. I am thinking about these ways to handle this: 
>>>> 
>>>> register a C-SPARQL stream query (window size = 1, slide = 1) to perform the filtering, feeding it to a another (window size and slide TBD);  the latter query notifies by emitting a CONSTRUCTed result.
>>>> 
>>>> arbitrary triple  stream —> [ PHYSICAL WINDOWed FILTER ] —>  filtered triple stream —>  [ PHYSICAL WINDOWED JOIN and or AGGREGATE  ] —> CONSTRUCTed triple stream
>>>> or 
>>>> compose a query whereby windowing is performed for the initial filtering and a subquery with separate windowing is performed for the aggregation/join (is this possible?)
>>>> 
>>>> In either case, the first part filters, e.g., down to a stream consisting only of infectees and the second part groups the infectees by region, counts them and emits the single respective notification. (SomeInfecteesInRegion and PotentialEpidemic). Thus, the questions above about composition of queries.
>>> 
>>> I believe I gave you already too many option. I leave this to answer once I read your answers.
>>> 
>>>> As for item 2 above, the obvious question: Will "LIMIT 1" limit the query to being matched one time only (per group) or does it mean that only one CONSTRUCT will be emitted each time the processing criteria are met (that is, when the window closes)?
>>> 
>>> See my query with two subqueries above.
>>> 
>>>> If it’s the former, I’m done. If the latter, I don’t see a way to my goal.
>>> 
>>> I may have misunderstood :-(
>>> 
>>>> 
>>>> It’s easy to think of this procedurally: look at non-finite data until an expression is matched and stop there. Or in a stream-ish approach, match the expression and deduplicate the output stream. Or, less cleanly, asserting the notifications to an RDF store and then including in the join expression a check for a prior alert before emitting one? Only the last one seems obvious with C-SPARQL (albeit “dirty”). 
>>>> 
>>>> Is there a clean way of doing this?
>>> 
>>> Let’s see. I’m curious too. Indeed this may require to extend the language and this is something I’m looking for :-)
>>> 
>>> Best Regards,
>>> 
>>> Emanuele
>>> 
>> 
> 
> Alasdair J G Gray
> Lecturer in Computer Science, Heriot-Watt University, UK.
> Email: A.J.G.Gray@hw.ac.uk <mailto:A.J.G.Gray@hw.ac.uk>
> Web: http://www.alasdairjggray.co.uk <http://www.alasdairjggray.co.uk/>
> ORCID: http://orcid.org/0000-0002-5711-4872 <http://orcid.org/0000-0002-5711-4872>
> Telephone: +44 131 451 3429
> Twitter: @gray_alasdair
> 
> 
> 
> 
> 
> 
> 
> 
> 
> We invite research leaders and ambitious early career researchers to join us in leading and driving research in key inter-disciplinary themes. Please see www.hw.ac.uk/researchleaders for further information and how to apply. 
> 
> Heriot-Watt University is a Scottish charity registered under charity number SC000278.
Received on Friday, 14 November 2014 22:52:20 UTC