Re: comments on Section 1 and Section 2 of SPARQL Query Language for RDF [OK?] [needstest] from Dan Connolly on 2006-03-22 (public-rdf-dawg-comments@w3.org from March 2006)

From: Dan Connolly <connolly@w3.org>
Date: Wed, 22 Mar 2006 11:46:42 -0600
To: "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>
Cc: public-rdf-dawg-comments@w3.org
Message-Id: <1143049602.12963.360.camel@dirk.w3.org>
On Wed, 2006-02-22 at 18:56 -0500, Peter F. Patel-Schneider wrote:
> 
> 
> Comments on Section 1 and Section 2 of
> 
> 	SPARQL Query Language for RDF
> 	W3C Working Draft 20 February 2006
> 	http://www.w3.org/TR/2006/WD-rdf-sparql-query-20060220/
> 
> 
> These are personal comments, from me, an interested expert.  They may not
> reflect the views of any institution to which I am associated.

Thank you very much for your detailed review...


> In general I found the first two sections of the document *very* hard to
> understand.  The mixing of definitions, explanation, information, etc. confused
> me over and over again.  I strongly suggest an organization something like:
> 
>   Introduction (informative)
>   Formal development (normative)
>     Underlying notions (normative)
>     Patterns and matching (normative)
>   SPARQL syntax (normative)
>   Informal narrative (informative)
>   Examples (informative)
> 
> I also found that things that didn't need to be explained were explained, and
> things that did need to be explained were not explained.  A major example of
> the latter is the role of the scoping graph.  Examples showing why E-matching
> is defined the way it is would be particularly useful.
> 
> 
> Because of the problems I see in Section 2, I do not feel that I can adequately
> understand the remainder of the document.  
> 
> Because of these problems I do not feel that this document should be advanced
> to the next stage in the W3C recommendation process without going through
> another last-call stage.  (This could however be performed by terminating the
> current last call, quickly fixing the document, and starting another last
> call.)

After perhaps overly brief consideration of your comments, we are
somewhat sympathetic to your concerns about organization and
clarity; however, we also have schedule considerations
and the investment in other reviewers. Re-organizing the document
at this stage would delay things considerably; it's not even clear
that we could get a sufficient number of reviewers to take another
look before CR.

The specific examples you give below are very valuable; I
am marking this thread [needstest], which allows us to find
it more easily during CR and integrate the examples you give
into our test suite. We have also discussed the possibility
of significant organizational changes after CR, such as
moving the formal definitions to the back of the document.

As far as I can tell, all of the examples you give are useful
clarification questions, but they do not demonstrate design errors.
If they do, in fact, demonstrate design errors, I'm reasonably
confident we will discover that as we integrate them into
our test suite during CR.

Are you, by chance, satisfied by this response, which does
not involve making the changes you request at this time,
but includes an offer to give them due consideration after
we request CR? If not, there's no need to reply; I'm marking
this comment down as outstanding dissent unless I hear otherwise.


> Specific comments follow:
> 
> Section 1.
> 
> 	An RDF graph is a set of triples; each triple consists of a
> 	<em>subject</em>, a <em>predicate</em> and an <em>object</em>. This is
> 	defined in RDF Concepts and Abstract Syntax.
> 
> C1.1: An unqualified "this" cannot be used at the beginning of the second sentence.
> 
> 	The RDF graph may be virtual, in that it is not fully materialized,
> 
> C1.2: Defining virtual in terms of another term that is not itself defined is not
> very useful.
> 
> 	only doing the work needed for each query to execute.
> 
> C1.3: Who is doing what work here?
> 
> 	SPARQL is a query language for getting information from such RDF
> 	graphs. 
> 
> C1.4: Surely a more formal tone is called for here.
> 
> 	It provides facilities to:
> 	- extract information in the form of URIs, blank nodes, plain and typed
> 	literals.
> 	- extract RDF subgraphs.
> 	- construct new RDF graphs based on information in the queried graphs.
> 
> C1.5: I don't recognize the intent of SPARQL in any of these options.
> 
> 	As a data access language, it is suitable for both local and remote
> 	use. 
> 
> C1.6: The "it" is rather too far from its referent.
> 
> 	The companion SPARQL Protocol for RDF document describes the remote
> 	access protocol.
> 
> C1.7: What about the "local" access protocol?  Is there one?  If so, where is it?  If
> not, why is there not one?
> 
> 	<!-- Commented Document Outline -->
> 
> C1.8: There appears to be significant commented-out portions of the document.  Do
> such parts of the document have any import?  If so, then they probably should
> not be commented-out.  If not, then the commented-out portions should be
> removed.
> 
> 
> Section 2.
> 
> C2.15: In general, Section 2 switches modes much too much.  Which parts of
> Section 2 are tutorial?  Which are definitional?  Which are explanatory?
> 
> 	The SPARQL query language is based on matching graph patterns.
> 
> C2.1: What is a "matching graph pattern"?  I do not believe that it is defined
> in the remainder of the document.  (Yes, yes, I know that the problem is
> actually that the sentence itself is poorly constructed.)
> 
> 	The simplest graph pattern is the triple pattern, which is like an RDF
> 	triple, but with the possibility of a variable instead of an RDF term
> 	in the subject, predicate or object positions.
> 
> C2.4: This should probably be stated more precisely, using, at least "and/or".
> 
> 	Combining triple gives a basic graph pattern, where an exact match to a
> 	graph is needed to fulfill a pattern.
> 
> C2.2: Probably "triple" should be "triples".
> 
> C2.3: I do not believe that this matches the intent of SPARQL queries.
> 
> 	The example below shows a SPARQL query to find the title of a book from
> 	the information in the given RDF graph.
> 
> C2.5: The use of "the given" here is not helpful.  I feel that it would be better
> to use an indefinite article instead.
> 
> 
> 	The terms delimited by "<>" are IRI references [...].  They stand for
> 	IRIs, either directly, or relative to a base IRI.
> 
> C2.6: What is a term?  Which terms?  What does "stand for" mean here?  What
> role does the base IRI play in this "stand for" relationship?
> 
> C2.7: The rules for IRIs are not adequately specified in Section 2.1.1.  Are
> the two abbreviated mechanisms enclosed in "<>"?  Can a prefix expand to a
> relative IRI?
> 
> 	optional datatype IRI or prefixed name (introduced by ^^)
> 
> C2.8: Can this be a relative IRI?  Is it expanded using the rules of
> Section 2.1.1?
> 
> 	Variables in SPARQL queries have global scope; it is the same variable
> 	everywhere in the query that the same name is used
> 
> C2.9:  Wrong number agreement.
> 
> 	Blank nodes are indicated by either the form _:a or use of [ ].
> 
> C2.10: Is _:a the *only* blank node allowed?  If not, which parts of these bits
> of syntax can vary, and how?
> 
> 	Triple Patterns are written as a list of subject, predicate, object; 
> 
> C2.11: The examples of triple patterns don't seem to be written this way.
> 
> 	The following examples express the same query: 
> 	[several examples]
> 	Prefixes are syntactic: the prefix name does not affect the query, nor
> 	do prefix names in queries need to be the same prefixes as used in a
> 	serialization of the data. The following query is equivalent to the
> 	previous examples and will give the same results when applied to the
> 	same data:
> 	[one example]
> 
> C2.12: The first group of examples appears to exhibit more internal variability
> than the single example adds.  Why, then, is the single example broken out?  Is
> there something that I am missing here?
> 
> 
> 	The data format used in this document is
> 
> C2.13: What is the "data"?
> 
> C2.16: Section 2.1 claims to be about "Writing a Simple Query", but doesn't
> seem to provide any guidance on this topic.
> 
> 	2.2 Initial Definitions
> 
> C2.14: There appears to have been quite a number of definitions already?  How,
> then, can this be an "initial" set of definitions?
> 
> 	A query variable is a member of the set V where V is infinite and
> 	disjoint.
> 
> C2.20:  What is V?  Perhaps you mean V to be some arbitrary, but fixed set.
> 
> 	Definition: Graph Pattern
> 	A Graph Pattern is one of:
> 	Basic Graph Pattern
> 	Group Graph Pattern
> 	Value Constraints
> 	Optional Graph Pattern
> 	Union Graph Pattern
> 	RDF Dataset Graph Pattern
> 
> C2.15: Are these all part of simple queries?  If not, what is this doing in
> Section 2?  Ditto for the definition for SPARQL Query.
> 
> 	Definition: SPARQL Query
> 	A SPARQL query is a tuple (GP, DS, SM, R) where:
> 
> C2.16: What, then, are the things in Section 2.1 that contain the SELECT
> keyword?
> 
> 	The following triple pattern has a subject variable (the variable
> 	book), a predicate dc:title and an object variable (the variable
> 	<title).
> 
> 	 ?book dc:title ?title .
> 
> C2.17: dc:title does not appear to be valid as any second element of a triple
> pattern.
> 
> 	Definition: Triple Pattern
> 	A triple pattern is member of the set:
> 	(RDF-T union V) x (I union V) x (RDF-T union V)
> 
> C2.18:  How is the syntax above (?book dc:title ?title .) mapped into this set?
> 
> 	This definition of Triple Pattern includes literal subjects.
> 	[...]
> 	This definition also allows blank nodes in the predicate position.
> 
> C2.19:  The referent is too far away for this construction.
> 
> 	Definition: Pattern Solution
> 	A variable solution is a substitution function from a subset of V, the
> 	set of variables, to the set of RDF terms, RDF-T.  
> 	A pattern solution, S, is a variable substitution whose domain includes
> 	all the variables in V and whose range is a subset of the set of RDF
> 	terms.  
> 	The result of replacing every member v of V in a graph pattern P by
> 	S(v) is written S(P).  
> 	If v is not in the domain of S then S(v) is defined to be v.
> 
> C2.21: I thought that V was the set of variables.  Why then write "all the
> variables in V"?
> 
> C2.22: Given that the domain of S is all the variables in V, i.e., all the
> variables, then what use is the last sentence of the above definition?
> 
> 	has a single triple pattern as the query pattern
> 
> C2.23:  What is the "query pattern" of a query?  Perhaps you mean the graph
> pattern of the query?
> 
> 	An E-entailment regime is a binary relation between subsets of RDF
> 	graphs.
> 
> C2.24: Perhaps you mean "between sets of RDF graphs"?
> 
> 	Definition: Scoping Graph
> 	The Scoping Graph G' for RDF graph G, is an RDF Graph that is
> 	graph-equivalent to G
> 
> C2.25: FATAL: There can be many RDF graphs that are graph-equivalent to a
> particular RDF graph.  Therefore the Scoping Graph is not adequately defined.
> 
> 	The scoping graph makes the graph to be matched independent of the
> 	chosen blank node names.
> 
> C2.25a: Which chosen blank node names?  Why should this matter at all?  Aren't
> the blank node names simply a notational convenience?
> 
> C2.25b: This needs to be proven.
> 
> 	Definition: Basic Graph Pattern E-matching
> 	Given an entailment regime E, a basic graph pattern BGP, and RDF graph
> 	G, with scoping graph G', then BGP E-matches with pattern solution S on
> 	graph G with respect to scoping set B if:
>         - BGP' is a basic graph pattern that is graph-equivalent to BGP
>         - G' and BGP' do not share any blank node labels.
>         - (G' union S(BGP')) is a well-formed RDF graph for E-entailment
>         - G E-entails (G' union S(BGP'))
>         - The RDF terms introduced by S all occur in B.
> 
> C2.26: Some of the elements of the point list are missing punctuation.
> 
> C2.27: FATAL: The status of B is not adequately provided.  Is B a parameter of
> E-matching or is it somehow determined by the other parameters?  
> 
> 	These definitions allow for future extensions to SPARQL.
> 
> C2.28:  Which definitions?
> 
> 	This document defines SPARQL for simple entailment and the scoping set
> 	B is the set of all RDF terms in G'.
> 
> C2.29:  SPARQL for simple entailment?  Probably you mean something like "This
> document only defines the simple entailment version of SPARQL".
> 
> C2.30:  The second half of this sentence does not make any sense.  Perhaps you
> mean something like "The simple entailment version of SPARQL (hereafter
> SPARQL) is based on BGP E-matching where the entailment regime (E) is always
> simple entailment and the scoping set (B) is always the set of RDF terms in
> G'.  
> 
> C2.31: FATAL: This still leaves SPARQL matching with the following parameters:
>   1/ the graph pattern BGP
>   2/ the RDF graph G
>   3/ the scoping graph G' (which is not adequately defined)
>   The problem with G' needs to be addressed.
> 
> 	A pattern solution can then be defined as follows: to match a basic
> 	graph pattern under simple entailment, it is possible to proceed by
> 	finding a mapping from blank nodes and variables in the basic graph
> 	pattern to terms in the graph being matched; a pattern solution is then
> 	a mapping restricted to just the variables, possibly with blank nodes
> 	renamed. Moreover, a uniqueness property guarantees the
> 	interoperability between SPARQL systems: given a graph and a basic
> 	graph pattern, the set of all the pattern solutions is unique up to
> 	blank node renaming.
> 
> C2.32: Where is G' in this operation?
> 
> C2.33: It seems to me that SPARQL simple matching is entirely deterministic.
> Given BGP, G, and G', the set of pattern solutions that make BGP match G with
> scope G' is fixed.  I then don't understand the "unique up to blank node
> renaming" above.
> 
> C2.34: If I am missing something here, and there indeed is something to be
> shown, then it has to be proven.
> 
> 	There is a blank node [..] in this dataset, identified by_:a. 
> 
> C2.34:  What is "dataset"?
> 
> C2.35:  Are there not two blank nodes in this dataset?
> 
> 	In the SPARQL syntax, Basic Graph Patterns are sequences of triple
> 	patterns mixed with value constraints.
> 
> C2.36:  Why not say something like "value constraints can be mixed in sequences
> of triples patterns.  The triple patterns form a BGP."?
> 
> 	The results of a query is
> 
> C2.37: Why not "The result"?
> 
> C2.39: I believe that it would be very useful to show the four matches
> generated by the basic query pattern in Section 2.6 (as well as the two matches
> for the BGP in Section 2.5.3).
> 
> 	Blank nodes in the results of a query are identical to those occurring
> 	in the dataset graphs
> 
> C2.38: This is very misleading.  SPARQL matching does indeed restrict the bnode
> in query results to be bnodes from the RDF graph, but not in a useful way.  For
> example,
>   ?x ex:a ex:b .
> matches against
>   _:a ex:a _:b .
> with two results for ?x, at least as far as I can determine.
> 
> C2.39: I believe that there are four matches for the BGP in Section 2.7.  Why
> are only two results given?
-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
Received on Wednesday, 22 March 2006 17:46:52 UTC