- From: Tara Athan <taraathan@gmail.com>
- Date: Fri, 6 Nov 2015 08:36:09 -0500
- To: public-rsp@w3.org
- Message-ID: <563CACC9.5000708@gmail.com>
I was unable to obtain a copy of the LARS reference, but I looked at the RSP-QL Semantics paper. It deals with an important special case - when the queries make no use of the information in the timestamp or the structure of named graphs - but I think it is too specialized for the semantic foundation of querying on RDF streams. I agree with Minh that the timestamped graphs obtained from applying the window function should not be prematurely merged. I think we can get a lot of mileage from what's already available in SPARQL, which provides syntax for accessing simultaneously the default graph and the named graphs of RDF datasets in a variety of ways. (http://www.w3.org/TR/2013/REC-sparql11-query-20130321/#rdfDataset, esp. 13.3.4 http://www.w3.org/TR/2013/REC-sparql11-query-20130321/#namedAndDefaultGraph) I suggest that we first investigate what can be accomplished with the existing capabilities of SPARQL, and only add to it when it is shown there is missing functionality. And once the semantics of queries for the general case is defined, then syntax can be developed that addresses both general and special cases. First, we can define a "unifed RDF dataset" of an RDF stream, so we have something to apply the SPARQL query to. Ideally, the unified RDF dataset would contain all the information of the RDF stream, so that the original stream could be reconstructed from it. ***** Option 1. There is a minor loss of information by the following definition: 1. The union of the default graphs of all timestamped graphs in the RDF stream is the default graph of the unified RDF dataset of the stream. 2. The union of the sets of named graphs of all timestamped graphs in the RDF stream is the set of named graphs of the unified RDF dataset of the stream. This unified RDF dataset can be converted back to an RDF stream, but the result is not necessarily unique unless we assert that 1. the relative order of timestamped graphs of different predicates is not significant in the abstract data model of an RDF stream. 2. the relative order of timestamped graphs with the same predicate and equal or incomparable temporal entities is not significant in the abstract data model of an RDF stream. The output of time-based window functions on specific predicates is independent of this relative order information. However, the output of window functions that are "cross-predicate" and/or count-based is dependent on the relative order information. ***** Option 2. To also capture the original order of timestamped graphs in the RDF stream, additional triples could be added to the unified RDF dataset. For example, each timestamp triple could be reified with a blank node, and then have a "successor predicate" provide the ordering information in the default graph. ***** I think that for the abstract data model, Option 2 is best. In the case of queries that don't make use of the relative order information, it can be ignored, but it is there for the cases when it is needed, and it enables the full reconstruction of the original RDF stream from the unified RDF dataset. Looking into the RDF 1.1 spec (http://www.w3.org/TR/rdf11-concepts/), I see that there is no requirement that RDF graphs be finite, and similarly RDF Datasets are not necessarily finite (http://www.w3.org/TR/rdf11-datasets/). Thus we are not restricted to finite streams when considering the unified RDF Dataset defined by the stream. Of course, the practical considerations of applying a query to an infinite RDF Dataset are another matter. To discuss a concrete case, suppose that an RDF stream query consists of a window function that is parameterized by a time instant, a (possibly infinite) sequence of time instants as inputs for the temporal parameter of the window function, a SPARQL query, and an optional entailment regime. (I think this does not cover all possible cases of windowing operations, but that is a separate question) Note. In this section (http://www.w3.org/TR/2014/NOTE-rdf11-datasets-20140225/#sec-sparql) of the RDF 1.1 working group note on RDF Datasets, the comparison of the SPARQL definition of RDF Dataset is compared with the RDF 1.1 definition. SPARQL is more restrictive than RDF 1.1 syntactically (names of graphs may not be blank nodes). Since our queries will in general use query parameters for the names of named graphs, this restriction is not significant. The precise semantics of RDF datasets doesn't appear to be necessary for the definition of subgraph matching SPARQL query semantics on RDF datasets, . However, if SPARQL entailment regimes are to be allowed in RDF stream queries, then we need to consider the semantics in more detail - http://www.w3.org/TR/sparql11-entailment/#DataSets Applying the window function to the original RDF stream for each member of the time sequence results in a sequence of RDF streams, which are all substreams of the original RDF stream. Based on the conversion to unified RDF dataset, this can also be treated as a sequence of RDF datasets. The SPARQL query is then applied to each unified RDF Dataset independently, producing a sequence of solution sequences. Example A Given an RDF stream using a set of predicates that includes ex:observedAt, with named graphs containing triples using the property ex:propertyOfInterest with numerical values and multiple subjects, and assuming the observations for each feature are equally spaced in time, so that a simple average is an appropriate aggregator. The following query extracts the average value of the property for each observed feature. SELECT (AVG(?val) AS ?sum) WHERE { ?g ex:observedAt ?time . GRAPH ?g { ?feature ex:propertyOfInterest ?val } } GROUP BY ?feature Example B Given an RDF stream using a set of predicates that includes ex:observedAt, with named graphs containing triples using the property ex:propertyOfInterest with numerical values and multiple subjects, and assuming the observations for each feature are equally spaced in time, so that a simple average is an appropriate aggregator. The following query extracts the average value of the property for each observed feature, and also notes the time of the first observation within the RDF stream. SELECT (AVG(?val) AS ?sum) (MIN(?time) AS ?firstTime) WHERE { ?g ex:observedAt ?time . GRAPH ?g { ?feature ex:propertyOfInterest ?val } } GROUP BY ?feature Each element in the solution sequence corresponds to the average of the property of interest for a different feature, and each average has a time entity for when the first observation happened for that pair of feature and property, which may be different from the first observation time for other feature-property pairs. If we want this RDF stream query to generate another RDF stream, then we may consider the following: Option A. Use the CONSTRUCT form, or some modification of it, to create a timestamped graph from each *solution sequence*. This means the CONSTRUCT syntax must specify two graphs - the default graph containing the new timestamp triple, and the named graph it is attached to. The CONSTRUCT syntax needs to be able to describe how the new temporal entity is determined (current time, or derived from the timestamps), what predicate is used, the name of the named graph, and the contents of the named graph. Consider Example A - the timestamp could use the time entity that was used to parameterize the window function. However, this would have to be passed to the CONSTRUCT syntax somehow. I don't believe SPARQL syntax is able to do this, as it appears the output of CONSTRUCT is a single graph, not an RDF dataset. Option B. Use the CONSTRUCT form, or some modification of it, to create two graphs from each *solution*. This would be similar to Option A, except that a timestamped graph must be specified based on the information in each solution, rather than the whole solution sequence. The Example B above presents a usecase The RDF streams would be merged for the final output. Note: This is one way to deal with the possibility that the window function output may not be finite (e.g. a lower-bound-only time window function). Again, I don't believe SPARQL syntax is able to do this. Option C. Some generalization of Option A & B that produces a general RDF stream based on information from the solution sequence as well as passed provenance information, such as the time entity used to parameterize the window function, the name of the window function itself, the source of the RDF stream, the processing agent, .... Some other usecases where the timestamp information is needed for querying 1. the times of the timestamped graphs are not uniformly spaced, which may arise from sensor or transmission failure even in the case when uniformly-spaced observations are expected. Also event-triggered observations will not in general by uniformly spaced in time. 2. queries that combine information from different timestamp predicates. Tara
Received on Friday, 6 November 2015 13:36:40 UTC