C-SPARQL Engine and handling of REGISTER, nested queries from Mark Feblowitz on 2014-10-13 (public-rsp@w3.org from October 2014)

From: Mark Feblowitz <MarkFeblowitz@comcast.net>
Date: Mon, 13 Oct 2014 17:42:28 -0400
To: Emanuele Della Valle <emanuele.dellavalle@polimi.it>
Cc: Andy Seaborne <andy@apache.org>, Marco Balduini <marco.balduini@polimi.it>, "public-rsp@w3.org" <public-rsp@w3.org>
Message-Id: <2F292241-D382-4332-A3F1-FBFA96FF163D@comcast.net>

Emanuele, and @RSP Community -

I have some questions, based on a specific item I am trying to implement in my work at IBM Research.

My C-SPARQL questions are:

Is there a means of performing simple filtration on each triple in a stream? (I’m thinking, PHYSICAL window of size 1)
Can a query to FILTER a stream be composed with a subquery that is also windowed (using different windowing criteria)?
Using the C-SPARQL engine, how does one take the output of a REGISTERed STREAM query and use it as input to another? Is there a special URL? Or must the user-defined result processor post a new stream, identifying a new URL?
For GROUPed windowed processing, is it correct to assume that the effect is as if there is a window per group?
How in general can one express a case where only one solution is emitted per group, in a GROUP BY … HAVING query? That’s one solution only, not one solution per processing pass.

Here’s the simplified scenario:

Examine a stream of arbitrary RDF triples, looking for Infectees — Persons infected by a particular virus. These infectees are grouped by Region.
A triple set is to be CONSTRUCTed when there are 0 < N < threshold infectees in a given region (“SomeInfectees” alert)
Another triple set is to be CONSTRUCTed when N >= threshold (“PossibleEpidemic” alert)

The goal here is to process an arbitrary stream of triples and to emit just a single alert - ever - per group.

So, there are two issues here:
filtering a stream and then applying windowed (?) match criteria for each expression for groups in the filtered result
ensuring that only one answer per expression per group results in a CONSTRUCT

As for item #1, I’ve tried a few things and now understand that I need to view this as a filter part and a aggregate or join part. I am thinking about these ways to handle this:

register a C-SPARQL stream query (window size = 1, slide = 1) to perform the filtering, feeding it to a another (window size and slide TBD); the latter query notifies by emitting a CONSTRUCTed result.

arbitrary triple stream —> [ PHYSICAL WINDOWed FILTER ] —> filtered triple stream —> [ PHYSICAL WINDOWED JOIN and or AGGREGATE ] —> CONSTRUCTed triple stream
or
compose a query whereby windowing is performed for the initial filtering and a subquery with separate windowing is performed for the aggregation/join (is this possible?)

In either case, the first part filters, e.g., down to a stream consisting only of infectees and the second part groups the infectees by region, counts them and emits the single respective notification. (SomeInfecteesInRegion and PotentialEpidemic). Thus, the questions above about composition of queries.

As for item 2 above, the obvious question: Will "LIMIT 1" limit the query to being matched one time only (per group) or does it mean that only one CONSTRUCT will be emitted each time the processing criteria are met (that is, when the window closes)?

If it’s the former, I’m done. If the latter, I don’t see a way to my goal.

It’s easy to think of this procedurally: look at non-finite data until an expression is matched and stop there. Or in a stream-ish approach, match the expression and deduplicate the output stream. Or, less cleanly, asserting the notifications to an RDF store and then including in the join expression a check for a prior alert before emitting one? Only the last one seems obvious with C-SPARQL (albeit “dirty”).

Is there a clean way of doing this?

Thanks,

Mark

Received on Monday, 13 October 2014 21:43:02 UTC