Some thoughts on queries applied to RDF streams from Tara Athan on 2015-11-06 (public-rsp@w3.org from November 2015)

From: Tara Athan <taraathan@gmail.com>
Date: Fri, 6 Nov 2015 08:36:09 -0500
To: public-rsp@w3.org
Message-ID: <563CACC9.5000708@gmail.com>
I was unable to obtain a copy of the LARS reference, but I looked at the 
RSP-QL Semantics paper. It deals with an important special case - when 
the queries make no use of the information in the timestamp or the 
structure of named graphs - but I think it is too specialized for the 
semantic foundation of querying on RDF streams. I agree with Minh that 
the timestamped graphs obtained from applying the window function should 
not be prematurely merged.

I think we can get a lot of mileage from what's already available in 
SPARQL, which provides syntax for accessing simultaneously the default 
graph and the named graphs of RDF datasets in a variety of ways. 
(http://www.w3.org/TR/2013/REC-sparql11-query-20130321/#rdfDataset, esp. 
13.3.4 
http://www.w3.org/TR/2013/REC-sparql11-query-20130321/#namedAndDefaultGraph)

I suggest that we first investigate what can be accomplished with the 
existing capabilities of SPARQL, and only add to it when it is shown 
there is missing functionality. And once the semantics of queries for 
the general case is defined, then syntax can be developed that addresses 
both general and special cases.

First, we can define a "unifed RDF dataset" of an RDF stream, so we have 
something to apply the SPARQL query to. Ideally, the unified RDF dataset 
would contain all the information of the RDF stream, so that the 
original stream could be reconstructed from it.

*****
Option 1.
There is a minor loss of information by the following definition:
1. The union of the default graphs of all timestamped graphs in the RDF 
stream is the default graph of the unified RDF dataset of the stream.
2. The union of the sets of named graphs of all timestamped graphs in 
the RDF stream is the set of named graphs of the unified RDF dataset of 
the stream.

This unified RDF dataset can be converted back to an RDF stream, but the 
result is not necessarily unique unless we assert that
1. the relative order of timestamped graphs of different predicates is 
not significant in the abstract data model of an RDF stream.
2. the relative order of timestamped graphs with the same predicate and 
equal or incomparable temporal entities is not significant in the 
abstract data model of an RDF stream.

The output of time-based window functions on specific predicates is 
independent of this relative order information. However, the output of 
window functions that are "cross-predicate" and/or count-based is 
dependent on the relative order information.
*****
Option 2.
To also capture the original order of timestamped graphs in the RDF 
stream, additional triples could be added to the unified RDF dataset. 
For example, each timestamp triple could be reified with a blank node, 
and then have a "successor predicate" provide the ordering information 
in the default graph.
*****

I think that for the abstract data model, Option 2 is best. In the case 
of queries that don't make use of the relative order information, it can 
be ignored, but it is there for the cases when it is needed, and it 
enables the full reconstruction of the original RDF stream from the 
unified RDF dataset.

Looking into the RDF 1.1 spec (http://www.w3.org/TR/rdf11-concepts/), I 
see that there is no requirement that RDF graphs be finite, and 
similarly RDF Datasets are not necessarily finite 
(http://www.w3.org/TR/rdf11-datasets/). Thus we are not restricted to 
finite streams when considering the unified RDF Dataset defined by the 
stream. Of course, the practical considerations of applying a query to 
an infinite RDF Dataset are another matter.

To discuss a concrete case, suppose that an RDF stream query consists of 
a window function that is parameterized by a time instant, a (possibly 
infinite) sequence of time instants as inputs for the temporal parameter 
of the window function, a SPARQL query, and an optional entailment 
regime. (I think this does not cover all possible cases of windowing 
operations, but that is a separate question)

Note. In this section 
(http://www.w3.org/TR/2014/NOTE-rdf11-datasets-20140225/#sec-sparql) of 
the RDF 1.1 working group note on RDF Datasets, the comparison of the 
SPARQL definition of RDF Dataset is compared with the RDF 1.1 
definition. SPARQL is more restrictive than RDF 1.1 syntactically (names 
of graphs may not be blank nodes). Since our queries will in general use 
query parameters for the names of named graphs, this restriction is not 
significant. The precise semantics of RDF datasets doesn't appear to be 
necessary for the definition of subgraph matching SPARQL query semantics 
on RDF datasets, . However, if SPARQL entailment regimes are to be 
allowed in RDF stream queries, then we need to consider the semantics in 
more detail - http://www.w3.org/TR/sparql11-entailment/#DataSets

Applying the window function to the original RDF stream for each member 
of the time sequence results in a sequence of RDF streams, which are all 
substreams of the original RDF stream. Based on the conversion to 
unified RDF dataset, this can also be treated as a sequence of RDF 
datasets. The SPARQL query is then applied to each unified RDF Dataset 
independently, producing a sequence of solution sequences.

Example A

Given an RDF stream using a set of predicates that includes 
ex:observedAt, with named graphs containing triples using the property 
ex:propertyOfInterest with numerical values and multiple subjects, and 
assuming the observations for each feature are equally spaced in time, 
so that a simple average is an appropriate aggregator.
The following query extracts the average value of the property for each 
observed feature.

SELECT (AVG(?val) AS ?sum)
WHERE {
   ?g ex:observedAt ?time .
   GRAPH ?g { ?feature ex:propertyOfInterest ?val }
  } GROUP BY ?feature



Example B

Given an RDF stream using a set of predicates that includes 
ex:observedAt, with named graphs containing triples using the property 
ex:propertyOfInterest with numerical values and multiple subjects, and 
assuming the observations for each feature are equally spaced in time, 
so that a simple average is an appropriate aggregator.
The following query extracts the average value of the property for each 
observed feature, and also notes the time of the first observation 
within the RDF stream.

SELECT (AVG(?val) AS ?sum) (MIN(?time) AS ?firstTime)
WHERE {
   ?g ex:observedAt ?time .
   GRAPH ?g { ?feature ex:propertyOfInterest ?val }
  } GROUP BY ?feature

Each element in the solution sequence corresponds to the average of the 
property of interest for a different feature, and each average has a 
time entity for when the first observation happened for that pair of 
feature and property, which may be different from the first observation 
time for other feature-property pairs. If we want this RDF stream query 
to generate another RDF stream, then we may consider the following: 
Option A. Use the CONSTRUCT form, or some modification of it, to create 
a timestamped graph from each *solution sequence*. This means the 
CONSTRUCT syntax must specify two graphs - the default graph containing 
the new timestamp triple, and the named graph it is attached to. The 
CONSTRUCT syntax needs to be able to describe how the new temporal 
entity is determined (current time, or derived from the timestamps), 
what predicate is used, the name of the named graph, and the contents of 
the named graph. Consider Example A - the timestamp could use the time 
entity that was used to parameterize the window function. However, this 
would have to be passed to the CONSTRUCT syntax somehow. I don't believe 
SPARQL syntax is able to do this, as it appears the output of CONSTRUCT 
is a single graph, not an RDF dataset. Option B.  Use the CONSTRUCT 
form, or some modification of it, to create two graphs from each 
*solution*. This would be similar to Option A, except that a timestamped 
graph must be specified based on the information in each solution, 
rather than the whole solution sequence. The Example B above presents a 
usecase The RDF streams would be merged for the final output. Note: This 
is one way to deal with the possibility that the window function output 
may not be finite (e.g. a lower-bound-only time window function). Again, 
I don't believe SPARQL syntax is able to do this. Option C. Some 
generalization of Option A & B that produces a general RDF stream based 
on information from the solution sequence as well as passed provenance 
information, such as the time entity used to parameterize the window 
function, the name of the window function itself, the source of the RDF 
stream, the processing agent, .... Some other usecases where the 
timestamp information is needed for querying 1. the times of the 
timestamped graphs are not uniformly spaced, which may arise from sensor 
or transmission failure even in the case when uniformly-spaced 
observations are expected. Also event-triggered observations will not in 
general by uniformly spaced in time. 2. queries that combine information 
from different timestamp predicates. Tara
Received on Friday, 6 November 2015 13:36:40 UTC