- From: Tara Athan <taraathan@gmail.com>
- Date: Fri, 6 Nov 2015 08:36:09 -0500
- To: public-rsp@w3.org
- Message-ID: <563CACC9.5000708@gmail.com>
I was unable to obtain a copy of the LARS reference, but I looked at the
RSP-QL Semantics paper. It deals with an important special case - when
the queries make no use of the information in the timestamp or the
structure of named graphs - but I think it is too specialized for the
semantic foundation of querying on RDF streams. I agree with Minh that
the timestamped graphs obtained from applying the window function should
not be prematurely merged.
I think we can get a lot of mileage from what's already available in
SPARQL, which provides syntax for accessing simultaneously the default
graph and the named graphs of RDF datasets in a variety of ways.
(http://www.w3.org/TR/2013/REC-sparql11-query-20130321/#rdfDataset, esp.
13.3.4
http://www.w3.org/TR/2013/REC-sparql11-query-20130321/#namedAndDefaultGraph)
I suggest that we first investigate what can be accomplished with the
existing capabilities of SPARQL, and only add to it when it is shown
there is missing functionality. And once the semantics of queries for
the general case is defined, then syntax can be developed that addresses
both general and special cases.
First, we can define a "unifed RDF dataset" of an RDF stream, so we have
something to apply the SPARQL query to. Ideally, the unified RDF dataset
would contain all the information of the RDF stream, so that the
original stream could be reconstructed from it.
*****
Option 1.
There is a minor loss of information by the following definition:
1. The union of the default graphs of all timestamped graphs in the RDF
stream is the default graph of the unified RDF dataset of the stream.
2. The union of the sets of named graphs of all timestamped graphs in
the RDF stream is the set of named graphs of the unified RDF dataset of
the stream.
This unified RDF dataset can be converted back to an RDF stream, but the
result is not necessarily unique unless we assert that
1. the relative order of timestamped graphs of different predicates is
not significant in the abstract data model of an RDF stream.
2. the relative order of timestamped graphs with the same predicate and
equal or incomparable temporal entities is not significant in the
abstract data model of an RDF stream.
The output of time-based window functions on specific predicates is
independent of this relative order information. However, the output of
window functions that are "cross-predicate" and/or count-based is
dependent on the relative order information.
*****
Option 2.
To also capture the original order of timestamped graphs in the RDF
stream, additional triples could be added to the unified RDF dataset.
For example, each timestamp triple could be reified with a blank node,
and then have a "successor predicate" provide the ordering information
in the default graph.
*****
I think that for the abstract data model, Option 2 is best. In the case
of queries that don't make use of the relative order information, it can
be ignored, but it is there for the cases when it is needed, and it
enables the full reconstruction of the original RDF stream from the
unified RDF dataset.
Looking into the RDF 1.1 spec (http://www.w3.org/TR/rdf11-concepts/), I
see that there is no requirement that RDF graphs be finite, and
similarly RDF Datasets are not necessarily finite
(http://www.w3.org/TR/rdf11-datasets/). Thus we are not restricted to
finite streams when considering the unified RDF Dataset defined by the
stream. Of course, the practical considerations of applying a query to
an infinite RDF Dataset are another matter.
To discuss a concrete case, suppose that an RDF stream query consists of
a window function that is parameterized by a time instant, a (possibly
infinite) sequence of time instants as inputs for the temporal parameter
of the window function, a SPARQL query, and an optional entailment
regime. (I think this does not cover all possible cases of windowing
operations, but that is a separate question)
Note. In this section
(http://www.w3.org/TR/2014/NOTE-rdf11-datasets-20140225/#sec-sparql) of
the RDF 1.1 working group note on RDF Datasets, the comparison of the
SPARQL definition of RDF Dataset is compared with the RDF 1.1
definition. SPARQL is more restrictive than RDF 1.1 syntactically (names
of graphs may not be blank nodes). Since our queries will in general use
query parameters for the names of named graphs, this restriction is not
significant. The precise semantics of RDF datasets doesn't appear to be
necessary for the definition of subgraph matching SPARQL query semantics
on RDF datasets, . However, if SPARQL entailment regimes are to be
allowed in RDF stream queries, then we need to consider the semantics in
more detail - http://www.w3.org/TR/sparql11-entailment/#DataSets
Applying the window function to the original RDF stream for each member
of the time sequence results in a sequence of RDF streams, which are all
substreams of the original RDF stream. Based on the conversion to
unified RDF dataset, this can also be treated as a sequence of RDF
datasets. The SPARQL query is then applied to each unified RDF Dataset
independently, producing a sequence of solution sequences.
Example A
Given an RDF stream using a set of predicates that includes
ex:observedAt, with named graphs containing triples using the property
ex:propertyOfInterest with numerical values and multiple subjects, and
assuming the observations for each feature are equally spaced in time,
so that a simple average is an appropriate aggregator.
The following query extracts the average value of the property for each
observed feature.
SELECT (AVG(?val) AS ?sum)
WHERE {
?g ex:observedAt ?time .
GRAPH ?g { ?feature ex:propertyOfInterest ?val }
} GROUP BY ?feature
Example B
Given an RDF stream using a set of predicates that includes
ex:observedAt, with named graphs containing triples using the property
ex:propertyOfInterest with numerical values and multiple subjects, and
assuming the observations for each feature are equally spaced in time,
so that a simple average is an appropriate aggregator.
The following query extracts the average value of the property for each
observed feature, and also notes the time of the first observation
within the RDF stream.
SELECT (AVG(?val) AS ?sum) (MIN(?time) AS ?firstTime)
WHERE {
?g ex:observedAt ?time .
GRAPH ?g { ?feature ex:propertyOfInterest ?val }
} GROUP BY ?feature
Each element in the solution sequence corresponds to the average of the
property of interest for a different feature, and each average has a
time entity for when the first observation happened for that pair of
feature and property, which may be different from the first observation
time for other feature-property pairs. If we want this RDF stream query
to generate another RDF stream, then we may consider the following:
Option A. Use the CONSTRUCT form, or some modification of it, to create
a timestamped graph from each *solution sequence*. This means the
CONSTRUCT syntax must specify two graphs - the default graph containing
the new timestamp triple, and the named graph it is attached to. The
CONSTRUCT syntax needs to be able to describe how the new temporal
entity is determined (current time, or derived from the timestamps),
what predicate is used, the name of the named graph, and the contents of
the named graph. Consider Example A - the timestamp could use the time
entity that was used to parameterize the window function. However, this
would have to be passed to the CONSTRUCT syntax somehow. I don't believe
SPARQL syntax is able to do this, as it appears the output of CONSTRUCT
is a single graph, not an RDF dataset. Option B. Use the CONSTRUCT
form, or some modification of it, to create two graphs from each
*solution*. This would be similar to Option A, except that a timestamped
graph must be specified based on the information in each solution,
rather than the whole solution sequence. The Example B above presents a
usecase The RDF streams would be merged for the final output. Note: This
is one way to deal with the possibility that the window function output
may not be finite (e.g. a lower-bound-only time window function). Again,
I don't believe SPARQL syntax is able to do this. Option C. Some
generalization of Option A & B that produces a general RDF stream based
on information from the solution sequence as well as passed provenance
information, such as the time entity used to parameterize the window
function, the name of the window function itself, the source of the RDF
stream, the processing agent, .... Some other usecases where the
timestamp information is needed for querying 1. the times of the
timestamped graphs are not uniformly spaced, which may arise from sensor
or transmission failure even in the case when uniformly-spaced
observations are expected. Also event-triggered observations will not in
general by uniformly spaced in time. 2. queries that combine information
from different timestamp predicates. Tara
Received on Friday, 6 November 2015 13:36:40 UTC