SPARQL Query Language from Herman ter Horst on 2005-10-28 (public-rdf-dawg-comments@w3.org from October 2005)

From: Herman ter Horst <herman.ter.horst@philips.com>
Date: Fri, 28 Oct 2005 13:45:37 +0200
To: public-rdf-dawg-comments@w3.org
Message-ID: <OFC99B9CA9.34EA98A2-ONC12570A8.003B7029-C12570A8.0040C484@philips.com>
SPARQL Query Language for RDF: review comments
(apologies that these comments are late)

=

Query facilities play an important role for RDF.
It is good that a standard query language for RDF is being
developed.
The document describes useful query primitives for RDF.
The document represents a lot of work.

=

In my view, it is difficult to build up a clear and
precise picture of SPARQL queries from the document.
A number of formal definitions in the document do not
seem to be sufficiently clear and precise.
For the normative description of RDF queries, this is a
problem, in my view.  Clear and precise definitions can 
form a basis for interoperability.
The document seems to be stronger on examples than on
definitions.  This holds, in particular, for the
definitions relating to graph patterns and matching.
I expand on this in the following comments.

-7: "a graph pattern P, where P is not a dataset graph
pattern, matches a dataset with solution S if P matches
the default graph of the dataset"
It seems that this definition does not reflect correctly
the intention w.r.t. examples 8.3 and 8.4.
The graph pattern of ex.8.3 seems to form a group pattern
consisting of two dataset graph patterns; therefore, this is not
a dataset graph pattern, so the definition prescribes that the
default graph should be matched, which is clearly not what
is intended.
-7: states a matching process but does not state what
a dataset graph pattern is, although that is the title
of the second definition.

-5.4: Optional graph patterns are defined formally as pairs
Opt(A,B) of graph patterns. How should the
graph pattern P1 OPT P2 OPT P3 in the example be interpreted
in terms of pairs: as (P1 OPT P2) OPT P3?
This is not made explicit in the document.
-5.4: The definition of matching of Opt(A,B) seems to miss
a kind of maximality requirement: if it is possible to
match B by extending a solution, then this should be done.
-5.4: the definition of matching Opt(A,B) does not mention
graphs.

-2.7: The text on "blank nodes and query results" is informal,
using only an example.  It seems that the point made affects
the matching process, although this is not included
in the definition of matching.  If I interpret 2.7
correctly, the following reformulation of the definition of
matching basic graph patterns is what is actually intended:
a basic graph pattern P and a solution S match a graph G
if S(P) is an RDF graph that is equivalent to a subset of G
(equivalent, in the sense defined in RDF Concepts).

In a later comment below I expand further on the intention
expressed informally in 2.7.  In particular I suggest to define
a notion of equivalence of solutions.

- The name "group" in "group patterns" does not illustrate
the intended meaning.  How does a group differ from a set?
(There is a standard mathematical meaning of the word group,
which does not apply to this context, which is confusing.)
I am in favor of replacing the term "group pattern" with
"conjunctive pattern", which clearly indicates the intended
meaning.
It would then be natural to replace "alternative patterns"
with "disjunctive patterns".

-2.2. This section, entitled "initial definitions", gives
an 8-line definition of graph patterns.  Since these lines
consist of forward references, this is not a definition
(in itself). (A similar list is provided in 4, but the latter
list is not equal to the first list.)
This definition and the following definition, of queries,
cannot be understood at this point because their parts cannot
be understood.
Section 5.3 states: "graph patterns are defined recursively".
For the complete definition of graph patterns, one needs
to go back and forth through the document: the parts of the
defition are interleaved with examples, information about
blank nodes, information about literals, etc.
Trying to understand the definition of graph patterns and
matching from the document is a time-consuming process which
does not lead completely to the desired understanding.
Quite generally, the forward references in formal definitions
leads to a problem for readibility, in my view.

In my view, the clarity of the SPARQL spec would benefit immensely
by collecting the definitions of graph pattern and matching
in one place, and by doing the same thing with other
formal definitions (although graph patterns and matching seem
to form most of the complexity).
The document can then be read in normal, forward-reading mode.
One can start with the formal definition of queries and query
solutions when the definitions of graph pattern and matching
(and the other ingredients of queries) are clear.
To make a comparison: the semantics specs for RDF and OWL
also have basic, relatively complicated technical ingredients
(i.e. RDFS/OWL interpretations) which are completely defined
before they are used in the corresponding definitions of
entailment.
(For a primer/tutorial document, a different way of
presentation would be preferable than for a formal, normative
specification.)

To make this comment more specific, I include a suggestion
for a possible way to phrase the top-level definitions of
graph pattern and matching.
I am making up some abstract syntax, to complete the
parts of the abstract syntax already included in the
document: OPT(P,Q), Graph(g,P).
Several parts of the two definitions that follow are copied
from the document.

Definition: graph pattern.
Graph patterns are defined recursively.
Basic graph patterns are graph patterns
If P1,...,Pn are graph patterns, then CON(P1,...,Pn)
is a graph pattern: conjunctive graph pattern
If P1,...,Pn are graph patterns, then DIS(P1,...,Pn)
is a graph pattern: disjunctive graph pattern
A value constraint is a graph pattern
If P and Q are graph patterns, then OPT(P,Q) is a
graph pattern: optional graph pattern
If g is a IRI or a variable and P a graph pattern,
then Graph(g,P) is a graph pattern: dataset graph pattern

For matching, it seems that datasets need to be considered
from the start, in view of the above comment on Section 7.
(It would be good to mention explicitly that an RDF graph
is viewed as an RDF dataset with no named graphs.)

Definition: matching of a graph pattern P and a solution S
on an RDF dataset DS = G,(ui,Gi).
basic graph pattern P: S(P) is an RDF graph that is equivalent
to a subset of G
CON(P1,...,Pn): for each j, Pj,S matches DS
DIS(P1,...,Pn): there is a j such that Pj,S matches DS
value constraint C: S(C) is true
Opt(P,Q): P,S matches DS and if possible, Q,S matches DS
(i.e. S is not "weaker/smaller than possible")
Graph(g,P): there is a j such that (g=uj or S(g)=uj)
and P,S matches Gj,(ui,Gi).

=

Definitions such as these would also facilitate the development
of proofs of statements about queries, by structural induction
on graph patterns.

For example, the intention that the identity of blank nodes
is irrelevant, expressed informally in 2.7, could be made
more precise in the following way.

Definition: two datasets G,(ui,Gi) and H,(ui,Hi) are equivalent
if G is equivalent to H and if for each i, Gi is equivalent to Hi.

Definition: two solutions S, T are equivalent if there is a
function U that maps each IRI and literal to itself,
that maps blank nodes to blank nodes, that maps different
blank nodes to different blank nodes, such that S and T
have the same domains, and such that
for each item x in the domain of S or T, S(x)=U(T(x)).

It follows by structural induction on graph patterns that
if the graph pattern P and the solution S match the dataset DS,
S is equivalent to S' and DS is equivalent to DS',
then P and S' match DS'.

=

The global structure of the document does not seem to be as
coherent and systematic as it could be.
This relates to the preceding comments about the way the
definitions are presented, and is reflected in the TOC.
For example, the chapter title "making simple queries"
suggests a tutorial kind of introduction, although the chapter
includes many definitions. The chapter title "working with RDF
literals" does not indicate the place of this chapter in a spec
of RDF queries. The chapter "graph patterns" only expands on
group graph patterns.  Why are there three chapters
on datasets? The chapter "query result forms" does not
only deal with query result forms but also with
solution modifiers.

It seems that a more logical, systematic way of organizing the
document, without the forward references that are now in the
formal definitions, could for example include four chapters
with the following titles, in the following order:
/RDF datasets
/Value constraints
/graph patterns and matching
/queries

=

10./10.2: 10.1 speaks of "the solutions of a graph pattern",
in 10.2 the definition of SELECT speaks of "the set of
solutions found by matching DS with G"
It seems that this could be made more precise, in line with
the intention expressed informally in 2.7, by noting that
these sets of solutions do not need to include solutions
that are equivalent in the sense defined above.

=

10.1: In the first definition, solution sequence, it is
confusing, in view of the standard meaning of
list, that a "list" is an "unordered collection".
Does the following fit with the intention?
A (instead of "The") solution sequence is any ordered collection
formed from the (non-equivalent) solutions of the query pattern.

10.1: definition of distinct solution sequence.
first, set(S) is the set of solutions in S (rather than the
set of solution sequences)
second, the phrase "distinct(S)=(Si|Si!=Sj for all i!=j)" does not
provide a definition of distinct(S); it seems to be enough
to give a definition in words, e.g.: distinct(S) is a
solution sequence that does not contain duplicate solutions
and such that set(distinct(S))=set(S).

=

10.3.2: definition of graph template:
S(tj) should be a set containing just one RDF triple
(if all variables ...), rather than an RDF triple;
otherwise, the union definition that follows does not work.

=


Herman ter Horst
Philips Research
Received on Friday, 28 October 2005 11:48:01 UTC