Notes on ter Horst's comments: SPARQL Query Language from Eric Prud'hommeaux on 2006-03-03 (public-rdf-dawg@w3.org from January to March 2006)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 3 Mar 2006 07:14:24 -0500
To: public-rdf-dawg@w3.org
Message-ID: <20060303121424.GD17752@w3.org>
pending headers:
To: Herman ter Horst <herman.ter.horst@philips.com>
Cc: public-rdf-dawg-comments@w3.org


On Fri, Oct 28, 2005 at 01:45:37PM +0200, Herman ter Horst wrote:
> SPARQL Query Language for RDF: review comments
> (apologies that these comments are late)
> 
> =
> 
> Query facilities play an important role for RDF.
> It is good that a standard query language for RDF is being
> developed.
> The document describes useful query primitives for RDF.
> The document represents a lot of work.
> 
> =
> 
> In my view, it is difficult to build up a clear and
> precise picture of SPARQL queries from the document.
> A number of formal definitions in the document do not
> seem to be sufficiently clear and precise.
> For the normative description of RDF queries, this is a
> problem, in my view.  Clear and precise definitions can 
> form a basis for interoperability.
> The document seems to be stronger on examples than on
> definitions.  This holds, in particular, for the
> definitions relating to graph patterns and matching.
> I expand on this in the following comments.
> 
> -7: "a graph pattern P, where P is not a dataset graph
> pattern, matches a dataset with solution S if P matches
> the default graph of the dataset"
> It seems that this definition does not reflect correctly
> the intention w.r.t. examples 8.3 and 8.4.
> The graph pattern of ex.8.3 seems to form a group pattern
> consisting of two dataset graph patterns; therefore, this is not
> a dataset graph pattern, so the definition prescribes that the
> default graph should be matched, which is clearly not what
> is intended.
> -7: states a matching process but does not state what
> a dataset graph pattern is, although that is the title
> of the second definition.
> 
> -5.4: Optional graph patterns are defined formally as pairs
> Opt(A,B) of graph patterns. How should the
> graph pattern P1 OPT P2 OPT P3 in the example be interpreted
> in terms of pairs: as (P1 OPT P2) OPT P3?
> This is not made explicit in the document.

The current text has:
[[
The OPTIONAL keyword is left-associative :

pattern OPTIONAL { pattern } OPTIONAL { pattern }

matches the same as:

{ pattern OPTIONAL { pattern } } OPTIONAL { pattern }
]]

> -5.4: The definition of matching of Opt(A,B) seems to miss
> a kind of maximality requirement: if it is possible to
> match B by extending a solution, then this should be done.
> -5.4: the definition of matching Opt(A,B) does not mention
> graphs.
> 
> -2.7: The text on "blank nodes and query results" is informal,
> using only an example.  It seems that the point made affects
> the matching process, although this is not included
> in the definition of matching.  If I interpret 2.7
> correctly, the following reformulation of the definition of
> matching basic graph patterns is what is actually intended:
> a basic graph pattern P and a solution S match a graph G
> if S(P) is an RDF graph that is equivalent to a subset of G
> (equivalent, in the sense defined in RDF Concepts).
> 
> In a later comment below I expand further on the intention
> expressed informally in 2.7.  In particular I suggest to define
> a notion of equivalence of solutions.
> 
> - The name "group" in "group patterns" does not illustrate
> the intended meaning.  How does a group differ from a set?
> (There is a standard mathematical meaning of the word group,
> which does not apply to this context, which is confusing.)
> I am in favor of replacing the term "group pattern" with
> "conjunctive pattern", which clearly indicates the intended
> meaning.
> It would then be natural to replace "alternative patterns"
> with "disjunctive patterns".
> 
> -2.2. This section, entitled "initial definitions", gives
> an 8-line definition of graph patterns.  Since these lines
> consist of forward references, this is not a definition
> (in itself). (A similar list is provided in 4, but the latter
> list is not equal to the first list.)
> This definition and the following definition, of queries,
> cannot be understood at this point because their parts cannot
> be understood.
> Section 5.3 states: "graph patterns are defined recursively".
> For the complete definition of graph patterns, one needs
> to go back and forth through the document: the parts of the
> defition are interleaved with examples, information about
> blank nodes, information about literals, etc.
> Trying to understand the definition of graph patterns and
> matching from the document is a time-consuming process which
> does not lead completely to the desired understanding.
> Quite generally, the forward references in formal definitions
> leads to a problem for readibility, in my view.
> 
> In my view, the clarity of the SPARQL spec would benefit immensely
> by collecting the definitions of graph pattern and matching
> in one place, and by doing the same thing with other
> formal definitions (although graph patterns and matching seem
> to form most of the complexity).
> The document can then be read in normal, forward-reading mode.
> One can start with the formal definition of queries and query
> solutions when the definitions of graph pattern and matching
> (and the other ingredients of queries) are clear.
> To make a comparison: the semantics specs for RDF and OWL
> also have basic, relatively complicated technical ingredients
> (i.e. RDFS/OWL interpretations) which are completely defined
> before they are used in the corresponding definitions of
> entailment.
> (For a primer/tutorial document, a different way of
> presentation would be preferable than for a formal, normative
> specification.)

DanC: we will consider this during CR

> To make this comment more specific, I include a suggestion
> for a possible way to phrase the top-level definitions of
> graph pattern and matching.
> I am making up some abstract syntax, to complete the
> parts of the abstract syntax already included in the
> document: OPT(P,Q), Graph(g,P).
> Several parts of the two definitions that follow are copied
> from the document.
> 
> Definition: graph pattern.
> Graph patterns are defined recursively.
> Basic graph patterns are graph patterns
> If P1,...,Pn are graph patterns, then CON(P1,...,Pn)
> is a graph pattern: conjunctive graph pattern
> If P1,...,Pn are graph patterns, then DIS(P1,...,Pn)
> is a graph pattern: disjunctive graph pattern
> A value constraint is a graph pattern
> If P and Q are graph patterns, then OPT(P,Q) is a
> graph pattern: optional graph pattern
> If g is a IRI or a variable and P a graph pattern,
> then Graph(g,P) is a graph pattern: dataset graph pattern
> 
> For matching, it seems that datasets need to be considered
> from the start, in view of the above comment on Section 7.
> (It would be good to mention explicitly that an RDF graph
> is viewed as an RDF dataset with no named graphs.)
> 
> Definition: matching of a graph pattern P and a solution S
> on an RDF dataset DS = G,(ui,Gi).
> basic graph pattern P: S(P) is an RDF graph that is equivalent
> to a subset of G
> CON(P1,...,Pn): for each j, Pj,S matches DS
> DIS(P1,...,Pn): there is a j such that Pj,S matches DS
> value constraint C: S(C) is true
> Opt(P,Q): P,S matches DS and if possible, Q,S matches DS
> (i.e. S is not "weaker/smaller than possible")
> Graph(g,P): there is a j such that (g=uj or S(g)=uj)
> and P,S matches Gj,(ui,Gi).
> 
> =
> 
> Definitions such as these would also facilitate the development
> of proofs of statements about queries, by structural induction
> on graph patterns.
> 
> For example, the intention that the identity of blank nodes
> is irrelevant, expressed informally in 2.7, could be made
> more precise in the following way.
> 
> Definition: two datasets G,(ui,Gi) and H,(ui,Hi) are equivalent
> if G is equivalent to H and if for each i, Gi is equivalent to Hi.
> 
> Definition: two solutions S, T are equivalent if there is a
> function U that maps each IRI and literal to itself,
> that maps blank nodes to blank nodes, that maps different
> blank nodes to different blank nodes, such that S and T
> have the same domains, and such that
> for each item x in the domain of S or T, S(x)=U(T(x)).
> 
> It follows by structural induction on graph patterns that
> if the graph pattern P and the solution S match the dataset DS,
> S is equivalent to S' and DS is equivalent to DS',
> then P and S' match DS'.
> 
> =
> 
> The global structure of the document does not seem to be as
> coherent and systematic as it could be.
> This relates to the preceding comments about the way the
> definitions are presented, and is reflected in the TOC.
> For example, the chapter title "making simple queries"
> suggests a tutorial kind of introduction, although the chapter
> includes many definitions. The chapter title "working with RDF
> literals" does not indicate the place of this chapter in a spec
> of RDF queries. The chapter "graph patterns" only expands on
> group graph patterns.  Why are there three chapters
> on datasets? The chapter "query result forms" does not
> only deal with query result forms but also with
> solution modifiers.
> 
> It seems that a more logical, systematic way of organizing the
> document, without the forward references that are now in the
> formal definitions, could for example include four chapters
> with the following titles, in the following order:
> /RDF datasets
> /Value constraints
> /graph patterns and matching
> /queries
> 
> =
> 
> 10./10.2: 10.1 speaks of "the solutions of a graph pattern",
> in 10.2 the definition of SELECT speaks of "the set of
> solutions found by matching DS with G"
> It seems that this could be made more precise, in line with
> the intention expressed informally in 2.7, by noting that
> these sets of solutions do not need to include solutions
> that are equivalent in the sense defined above.
> 
> =
> 
> 10.1: In the first definition, solution sequence, it is
> confusing, in view of the standard meaning of
> list, that a "list" is an "unordered collection".
> Does the following fit with the intention?
> A (instead of "The") solution sequence is any ordered collection
> formed from the (non-equivalent) solutions of the query pattern.
> 
> 10.1: definition of distinct solution sequence.
> first, set(S) is the set of solutions in S (rather than the
> set of solution sequences)
> second, the phrase "distinct(S)=(Si|Si!=Sj for all i!=j)" does not
> provide a definition of distinct(S); it seems to be enough
> to give a definition in words, e.g.: distinct(S) is a
> solution sequence that does not contain duplicate solutions
> and such that set(distinct(S))=set(S).
> 
> =
> 
> 10.3.2: definition of graph template:
> S(tj) should be a set containing just one RDF triple
> (if all variables ...), rather than an RDF triple;
> otherwise, the union definition that follows does not work.
> 
> =
> 
> 
> Herman ter Horst
> Philips Research
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Friday, 3 March 2006 12:14:30 UTC