Re: comments on Section 1 and Section 2 of SPARQL Query Language for RDF from Eric Prud'hommeaux on 2007-05-18 (public-rdf-dawg-comments@w3.org from May 2007)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Thu, 17 May 2007 17:43:34 -0700
To: "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>
Cc: public-rdf-dawg-comments@w3.org
Message-ID: <20070518004333.GC4091@w3.org>
The Data Access Working Group is ready to bring SPARQL Query to
Candidate Recommendation. The objections posted by Peter F.
Patel-Schneider pertain to parts of the language that have changed
since the last CR transition. We hope PFPS will agree to the language
changes, withdraw his objection, and help us with editorial updates
during the Candidate Recommendation phase.

Dear Peter,

It has been 15 months since your comments, and we have reorganized the
document substantially, hopefully in ways that address your comments.
(Please see section 12 to see the aggregated definitions and note that
section 2 is now informative.) I have responded to many of your
comments with "[gone]". Others are marked with "[definitions
replaced]". These annotations are sprinkled throught this reply with
the goal of responding to each comment.

I have drafted text to address your editorial comments and will
propose it to the working group after the transition to CR. None of
these changes affect the semantics of the query language as understood
by the working group.

There have been some changes to the entailment regime in the past
year. Your technical comments (both numbered C2.39) should be
addressed by the new semantics. If you wish to persue either the
editorial or technical comments, we should split out the thread as
the distinction is important to the W3C publication process.

* Peter F. Patel-Schneider <pfps@research.bell-labs.com> [2006-02-22 18:56-0500]
> Comments on Section 1 and Section 2 of
> 
> 	SPARQL Query Language for RDF
> 	W3C Working Draft 20 February 2006
> 	http://www.w3.org/TR/2006/WD-rdf-sparql-query-20060220/
> 
> 
> These are personal comments, from me, an interested expert.  They may not
> reflect the views of any institution to which I am associated.
> 
> 
> In general I found the first two sections of the document *very* hard to
> understand.  The mixing of definitions, explanation, information, etc. confused
> me over and over again.  I strongly suggest an organization something like:
> 
>   Introduction (informative)
>   Formal development (normative)
>     Underlying notions (normative)
>     Patterns and matching (normative)
>   SPARQL syntax (normative)
>   Informal narrative (informative)
>   Examples (informative)

We have re-organized the document substantially, including gathering
the formal semantics, but continue to have the descriptive text before
formal semantics.

> I also found that things that didn't need to be explained were explained, and
> things that did need to be explained were not explained.  A major example of
> the latter is the role of the scoping graph.  Examples showing why E-matching
> is defined the way it is would be particularly useful.

E-matching has been removed from the SPARQL semantics.

> Because of the problems I see in Section 2, I do not feel that I can adequately
> understand the remainder of the document.  
> 
> Because of these problems I do not feel that this document should be advanced
> to the next stage in the W3C recommendation process without going through
> another last-call stage.  (This could however be performed by terminating the
> current last call, quickly fixing the document, and starting another last
> call.)
> 
> 
> 
> Specific comments follow:
> 
> Section 1.
> 
> 	An RDF graph is a set of triples; each triple consists of a
> 	<em>subject</em>, a <em>predicate</em> and an <em>object</em>. This is
> 	defined in RDF Concepts and Abstract Syntax.
> 
> C1.1: An unqualified "this" cannot be used at the beginning of the second sentence.

[gone]

> 	The RDF graph may be virtual, in that it is not fully materialized,
> 
> C1.2: Defining virtual in terms of another term that is not itself defined is not
> very useful.

[gone]

> 	only doing the work needed for each query to execute.
> 
> C1.3: Who is doing what work here?

[gone]

> 	SPARQL is a query language for getting information from such RDF
> 	graphs. 
> 
> C1.4: Surely a more formal tone is called for here.

[gone]

> 	It provides facilities to:
> 	- extract information in the form of URIs, blank nodes, plain and typed
> 	literals.
> 	- extract RDF subgraphs.
> 	- construct new RDF graphs based on information in the queried graphs.
> 
> C1.5: I don't recognize the intent of SPARQL in any of these options.

[gone]

> 	As a data access language, it is suitable for both local and remote
> 	use. 
> 
> C1.6: The "it" is rather too far from its referent.

[gone]

> 	The companion SPARQL Protocol for RDF document describes the remote
> 	access protocol.
> 
> C1.7: What about the "local" access protocol?  Is there one?  If so, where is it?  If
> not, why is there not one?

It is common practice to use remote protocols like HTTP, SMTP or SSH
with both the client and the server on the same machine, sometimes not
even on a network. Defining a remote access protocol enables local
access at the same time.

The DAWG was not chartered to produce an alternative local-only
protocol or API.

> 	<!-- Commented Document Outline -->
> 
> C1.8: There appears to be significant commented-out portions of the document.  Do
> such parts of the document have any import?  If so, then they probably should
> not be commented-out.  If not, then the commented-out portions should be
> removed.

[gone]

> 
> Section 2.
> 
> C2.15: In general, Section 2 switches modes much too much.  Which parts of
> Section 2 are tutorial?  Which are definitional?  Which are explanatory?

Section 2 has been entirely re-written. Please revisit it.

> 	The SPARQL query language is based on matching graph patterns.
> 
> C2.1: What is a "matching graph pattern"?  I do not believe that it is defined
> in the remainder of the document.  (Yes, yes, I know that the problem is
> actually that the sentence itself is poorly constructed.)

Would you be content with striking "matching" after the CR is published?
  [[
  The SPARQL query language is based on graph patterns.
  ]]

> 	The simplest graph pattern is the triple pattern, which is like an RDF
> 	triple, but with the possibility of a variable instead of an RDF term
> 	in the subject, predicate or object positions.

[gone] now:
  [[
  Triple patterns are like RDF triples, but with the option of a query
  variables in place of RDF terms in the subject, predicate or object
  positions.
  ]]

> C2.4: This should probably be stated more precisely, using, at least "and/or".
> 
> 	Combining triple gives a basic graph pattern, where an exact match to a
> 	graph is needed to fulfill a pattern.
> 
> C2.2: Probably "triple" should be "triples".
> 
> C2.3: I do not believe that this matches the intent of SPARQL queries.

(for C2.2-4) I propose
  [[
  The SPARQL query language matches RDF data against graph
  patterns. The simplest type of graph pattern is a basic graph
  pattern, which consists of triple patterns. Triple patterns are like
  RDF triples, but may contain query variables in place of RDF Terms
  in the subject, predicate, and/or object positions.
  ]]

> 	The example below shows a SPARQL query to find the title of a book from
> 	the information in the given RDF graph.
> 
> C2.5: The use of "the given" here is not helpful.  I feel that it would be better
> to use an indefinite article instead.

The input RDF graph for this example is identified by this text:
  [[
  Data:

  <http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> "SPARQL Tutorial" .
  ]]
An indefinite article could imply that applying the example query to
*any* RDF graph would produce the the example result.

> 	The terms delimited by "<>" are IRI references [...].  They stand for
> 	IRIs, either directly, or relative to a base IRI.
> 
> C2.6: What is a term?  Which terms?  What does "stand for" mean here?  What
> role does the base IRI play in this "stand for" relationship?

I propose this introduction to 4.1.1:
  [[
  Production [67] IRIref designates the set of IRIs [RFC3987]; IRIs
  are a generalization of URIs [RFC3986] and are fully compatible with
  URIs and URLs.  Production [68] PrefixedName designates a prefixed
  name. The mapping from a prefixed name to an IRI is described below.
  IRI references (relative or absolute IRIs) are designated by
  production [70] IRI_REF, where the '<' and '>' delimiters do not
  form part of the IRI reference.  Relative IRIs match the
  irelative-ref reference in section 2.2 ABNF for IRI References and
  IRIs in [RFC3987] and are resolved to IRIs as described below.
  ]]
It has a couple in-chapter forward references, but I think it's pretty exact
and preserves the flow.

> C2.7: The rules for IRIs are not adequately specified in Section 2.1.1.  Are
> the two abbreviated mechanisms enclosed in "<>"?

Looking at http://www.w3.org/2001/sw/DataAccess/rq23/rq25#QSynIRI , I
think this has been addressed by adding the grammar, Prefixed Names
and Relative IRIs sections.

>                                                   Can a prefix expand to a
> relative IRI?

Looking at the mechanism, I don't see how that is possible. Prefixed
names concatonate the local name to the namespace associated with the
prefix. The namespace is an IRI reference, which is resolved against
the base IRI.

> 	optional datatype IRI or prefixed name (introduced by ^^)
> 
> C2.8: Can this be a relative IRI?  Is it expanded using the rules of
> Section 2.1.1?

No, yes. The grammar identifies this as an IRI_REF, to which 3986
applies.

> 	Variables in SPARQL queries have global scope; it is the same variable
> 	everywhere in the query that the same name is used
> 
> C2.9:  Wrong number agreement.

[gone] now:
  [[
  Query variables in SPARQL queries have global scope; use of a given
  variable name anywhere in a query identifies the same variable.
  ]]

> 	Blank nodes are indicated by either the form _:a or use of [ ].
> 
> C2.10: Is _:a the *only* blank node allowed?  If not, which parts of these bits
> of syntax can vary, and how?

The grammar excerpt in the bottom of 4.1.4 ground the illustrative
text above.

> 	Triple Patterns are written as a list of subject, predicate, object; 
> 
> C2.11: The examples of triple patterns don't seem to be written this way.

This text is now in section 4.2. How about
  [[
  Triple Patterns are written as a white-space separated list of a
  subject, predicate and object;
  ]]
?

> 	The following examples express the same query: 
> 	[several examples]
> 	Prefixes are syntactic: the prefix name does not affect the query, nor
> 	do prefix names in queries need to be the same prefixes as used in a
> 	serialization of the data. The following query is equivalent to the
> 	previous examples and will give the same results when applied to the
> 	same data:
> 	[one example]
> 
> C2.12: The first group of examples appears to exhibit more internal variability
> than the single example adds.  Why, then, is the single example broken out?  Is
> there something that I am missing here?

[gone] that text and the other example have been removed.

> 	The data format used in this document is
> 
> C2.13: What is the "data"?

The data provided in each example.

> C2.16: Section 2.1 claims to be about "Writing a Simple Query", but doesn't
> seem to provide any guidance on this topic.

What would you expect that you don't find here?

> 	2.2 Initial Definitions
> 
> C2.14: There appears to have been quite a number of definitions already?  How,
> then, can this be an "initial" set of definitions?

These definitions are now the first definitions in section 12.

> 	A query variable is a member of the set V where V is infinite and
> 	disjoint.
> 
> C2.20:  What is V?  Perhaps you mean V to be some arbitrary, but fixed set.

[definitions replaced]

> 	Definition: Graph Pattern
> 	A Graph Pattern is one of:
> 	Basic Graph Pattern
> 	Group Graph Pattern
> 	Value Constraints
> 	Optional Graph Pattern
> 	Union Graph Pattern
> 	RDF Dataset Graph Pattern
> 
> C2.15: Are these all part of simple queries?  If not, what is this doing in
> Section 2?  Ditto for the definition for SPARQL Query.

[definitions replaced]

> 	Definition: SPARQL Query
> 	A SPARQL query is a tuple (GP, DS, SM, R) where:
> 
> C2.16: What, then, are the things in Section 2.1 that contain the SELECT
> keyword?

SPARQL query strings

> 	The following triple pattern has a subject variable (the variable
> 	book), a predicate dc:title and an object variable (the variable
> 	<title).
> 
> 	 ?book dc:title ?title .
> 
> C2.17: dc:title does not appear to be valid as any second element of a triple
> pattern.

[gone] though there are many examples with a predicate of dc:title.
What leads you to believe that dc:title is not a valid 2nd element of
a triple pattern?

> 	Definition: Triple Pattern
> 	A triple pattern is member of the set:
> 	(RDF-T union V) x (I union V) x (RDF-T union V)
> 
> C2.18:  How is the syntax above (?book dc:title ?title .) mapped into this set?

[definitions replaced]

> 	This definition of Triple Pattern includes literal subjects.
> 	[...]
> 	This definition also allows blank nodes in the predicate position.
> 
> C2.19:  The referent is too far away for this construction.

[definitions replaced]

> 	Definition: Pattern Solution
> 	A variable solution is a substitution function from a subset of V, the
> 	set of variables, to the set of RDF terms, RDF-T.  
> 	A pattern solution, S, is a variable substitution whose domain includes
> 	all the variables in V and whose range is a subset of the set of RDF
> 	terms.  
> 	The result of replacing every member v of V in a graph pattern P by
> 	S(v) is written S(P).  
> 	If v is not in the domain of S then S(v) is defined to be v.
> 
> C2.21: I thought that V was the set of variables.  Why then write "all the
> variables in V"?

[definitions replaced]

> C2.22: Given that the domain of S is all the variables in V, i.e., all the
> variables, then what use is the last sentence of the above definition?

[definitions replaced]

> 	has a single triple pattern as the query pattern
> 
> C2.23:  What is the "query pattern" of a query?  Perhaps you mean the graph
> pattern of the query?

@@ a query pattern is (appears to be?) the pattern in a WHERE
clause. 
@@ needs more explaination/excuse

  propose to add:
  [[
  The outer-mose graph pattern in a query is called the query
  pattern. It is grammatically identified GroupGraphPattern in
    [13] WhereClause ::= 'WHERE '? GroupGraphPattern
  ]]
  just above 5.1

> 	An E-entailment regime is a binary relation between subsets of RDF
> 	graphs.
> 
> C2.24: Perhaps you mean "between sets of RDF graphs"?

[definitions replaced]

> 	Definition: Scoping Graph
> 	The Scoping Graph G' for RDF graph G, is an RDF Graph that is
> 	graph-equivalent to G
> 
> C2.25: FATAL: There can be many RDF graphs that are graph-equivalent to a
> particular RDF graph.  Therefore the Scoping Graph is not adequately defined.

[definitions replaced]

> 	The scoping graph makes the graph to be matched independent of the
> 	chosen blank node names.
> 
> C2.25a: Which chosen blank node names?  Why should this matter at all?  Aren't
> the blank node names simply a notational convenience?

[definitions replaced]

> C2.25b: This needs to be proven.

[definitions replaced]

> 	Definition: Basic Graph Pattern E-matching
> 	Given an entailment regime E, a basic graph pattern BGP, and RDF graph
> 	G, with scoping graph G', then BGP E-matches with pattern solution S on
> 	graph G with respect to scoping set B if:
>         - BGP' is a basic graph pattern that is graph-equivalent to BGP
>         - G' and BGP' do not share any blank node labels.
>         - (G' union S(BGP')) is a well-formed RDF graph for E-entailment
>         - G E-entails (G' union S(BGP'))
>         - The RDF terms introduced by S all occur in B.
> 
> C2.26: Some of the elements of the point list are missing punctuation.

[definitions replaced]

> C2.27: FATAL: The status of B is not adequately provided.  Is B a parameter of
> E-matching or is it somehow determined by the other parameters?

[definitions replaced]

> 	These definitions allow for future extensions to SPARQL.
> 
> C2.28:  Which definitions?

[definitions replaced] (Note, we still allow for extensions as
described in 12.6 .)

> 	This document defines SPARQL for simple entailment and the scoping set
> 	B is the set of all RDF terms in G'.
> 
> C2.29:  SPARQL for simple entailment?  Probably you mean something like "This
> document only defines the simple entailment version of SPARQL".

[gone]

> C2.30:  The second half of this sentence does not make any sense.  Perhaps you
> mean something like "The simple entailment version of SPARQL (hereafter
> SPARQL) is based on BGP E-matching where the entailment regime (E) is always
> simple entailment and the scoping set (B) is always the set of RDF terms in
> G'.

[gone]

> C2.31: FATAL: This still leaves SPARQL matching with the following parameters:
>   1/ the graph pattern BGP
>   2/ the RDF graph G
>   3/ the scoping graph G' (which is not adequately defined)
>   The problem with G' needs to be addressed.
> 
> 	A pattern solution can then be defined as follows: to match a basic
> 	graph pattern under simple entailment, it is possible to proceed by
> 	finding a mapping from blank nodes and variables in the basic graph
> 	pattern to terms in the graph being matched; a pattern solution is then
> 	a mapping restricted to just the variables, possibly with blank nodes
> 	renamed. Moreover, a uniqueness property guarantees the
> 	interoperability between SPARQL systems: given a graph and a basic
> 	graph pattern, the set of all the pattern solutions is unique up to
> 	blank node renaming.

[definitions replaced]

> C2.32: Where is G' in this operation?

[definitions replaced]

> C2.33: It seems to me that SPARQL simple matching is entirely deterministic.
> Given BGP, G, and G', the set of pattern solutions that make BGP match G with
> scope G' is fixed.  I then don't understand the "unique up to blank node
> renaming" above.

[definitions replaced]

> C2.34: If I am missing something here, and there indeed is something to be
> shown, then it has to be proven.

[definitions replaced]

> 	There is a blank node [..] in this dataset, identified by_:a. 
> 
> C2.34:  What is "dataset"?

The term "RDF Dataset" is now introduced in chapter 8; no forward
references are necessary.

> C2.35:  Are there not two blank nodes in this dataset?

[gone]

> 	In the SPARQL syntax, Basic Graph Patterns are sequences of triple
> 	patterns mixed with value constraints.
> 
> C2.36:  Why not say something like "value constraints can be mixed in sequences
> of triples patterns.  The triple patterns form a BGP."?

I will propose that to the group after CR.

> 	The results of a query is
> 
> C2.37: Why not "The result"?

[gone]

> C2.39: I believe that it would be very useful to show the four matches
> generated by the basic query pattern in Section 2.6 (as well as the two matches
> for the BGP in Section 2.5.3).

The working group feels that the current semantics applied to the query in
  http://www.w3.org/TR/2007/WD-rdf-sparql-query-20070326/#MultipleMatches
produce two matches shown. Do you disagree?

> 	Blank nodes in the results of a query are identical to those occurring
> 	in the dataset graphs
> 
> C2.38: This is very misleading.  SPARQL matching does indeed restrict the bnode
> in query results to be bnodes from the RDF graph, but not in a useful way.  For
> example,
>   ?x ex:a ex:b .
> matches against
>   _:a ex:a _:b .
> with two results for ?x, at least as far as I can determine.

[gone] now:
  [[
  Blank node labels are scoped to a result set (as defined in "SPARQL
  Query Results XML Format") or, for the CONSTRUCT query form, the
  result graph. Use of the same label within a result set indicates
  the same blank node.
  ]]

> C2.39: I believe that there are four matches for the BGP in Section 2.7.  Why
> are only two results given?

The working group feels that the current semantics applied to the query in
  http://www.w3.org/2001/sw/DataAccess/rq23/rq25.html#BlankNodesInResults
produce two matches shown.
-- 
-eric

office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
mobile: +1.617.599.3509

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Friday, 18 May 2007 00:44:26 UTC