editorial comments on SPARQL Query Lanuage for RDF from Fred Zemke on 2006-01-12 (public-rdf-dawg-comments@w3.org from January 2006)

From: Fred Zemke <fred.zemke@oracle.com>
Date: Thu, 12 Jan 2006 09:47:40 -0800
To: public-rdf-dawg-comments@w3.org
Message-ID: <43C6963C.6050105@oracle.com>
2.1.1 Syntax of IRI terms
The word "term" appears here without introduction or definition.
It might help to say that the primary lexical constituent of
triple patterns is terms, of which there are n varieties (fill in
the correct number.  At a cursory glance, I see IRI terms, literal
terms, and variables.  Much later in the specification, blank nodes appear
as another kind of term).


2.1.5, Examples of query syntax
It says "nor do prefix names in queries need to be the same prefixes
as used for data".  But this specification does  not provide a language
for describing or entering data.  Compare with section 2.1.6 "Data
descriptions used in this document" which says that this specification
uses Turtle to portray RDF data.  Perhaps there are other
representations for RDF, some of which also provide prefixes.
Perhaps the statement could be fixed by changing it to "nor do prefix
names in queries need to be the same as prefixes used in some
language for portraying RDF data, such as Turtle."


2.1.7 Result descriptions used in this document
This introduces the phrase "RDF term", which is not defined until
section 2.2 "Initial definitions".  A hot link to the definition
might be useful.

In addition, section 2.1.1 used the word "term", evidently to mean
a lexical token of SPARQL.  This is potentially confusing.  Possibly
one of the following ideas would be useful:
-- change "term" to "token" in section 2.1.1
-- change "term" to "SPARQL term" in section 2.1.1, to create a
constrast with "RDF term".


2.2 Initial definitions
The definition of "query variable" does not mention the lexical
requirements of VARNAME in Appendix A.


2.4 Pattern solutions
It says "The result of replacing every member v of W in a graph pattern
P by S(v) is written S(P)".  But what is S?  I think you mean,
"If S is a pattern solution, then...".  Or you could reword the
sentence "A pattern solution is a substitution function..." to define
S in that sentence.


2.5 Basic graph patterns
It says "The SPARQL syntax uses the keyword WHERE to intoduce the
Query pattern."  But the grammar in Appendix A shows that WHERE is
optional.  The reality appears to be "the SPARQL grammar uses curly
braces to enclose the query pattern.  Optionally the keyword WHERE
may be used immediately prior to the opening curly brace."  An alternative
solution would be to make WHERE mandatory in the grammar.  However,
it would still be good to state that graph patterns are enclosed in
curly braces.


2.6 Multiple matches
It says "The results of a query are all the ways a query can match
the graph being queried".  But you have introduced formal terminology;
why are you not using it?  In your formal terminology, what you mean is
"The results of a query is the set of all pattern solutions that match
the dataset of the query."  It is probably okay to include the informal
translation too.


2.7.1 Blank nodes and queries
There are no examples in this section.  The reader is left with the
impression that section 2.7.2 "Blank nodes and query results" is
intended to provide the examples for section 2.7.1.  But I think
that the two sections are actually orthogonal topics.  Section 2.7.1
talks about blank nodes in the query, such as
SELECT ?x WHERE { _:a foaf:name ?x .}, whereas section 2.7.2 talks
about blank nodes in the result. 

Perhaps the solution is to delete section 2.7.1, which appears to
be completely redundant with section 2.8.3 "Blank nodes".


2.7.1 Blank nodes and query results
It says "A blank node in a query may match any RDF term". 
I think this wording is too loose.  One might think that this means
that a blank node is a wildcard that may match different RDF terms
in different triple patterns.  Example: 
SELECT ?x ?y ?z WHERE { ?x _:a ?y . ?x _:a ?z . }
One might think that one can bind the blank node _:a to one RDF term
in one triple and a different RDF term in a different triple
of the dataset
(as if the example were SELECT ?x ?y ?z WHERE { ?x * ?y . ?x * ?z . }
using * to indicate a wildcard for the verb in each triple).
However, the definition of pattern solution in section 2.4 seems to
indicate that the same mapping of a blank node to an RDF term is
required for each triple pattern.  This should be reiterated here.


2.8.1 Object lists
It would be helpful to show an example that uses both predicate-object
lists and object lists, for example
?x v:erb1 ?z, ?w ; v:erb2 ?r, ?s .
is equivalent to
?x v:erb1 ?z .
?x v:erb1 ?w .
?x v:erb2 ?r .
?x v:erb2 ?s .


2.8.3 Blank nodes
What is the relationship between this section and section 2.7.1 "Blank
nodes and queries"?
Perhaps they can be combined or one of them can be deleted (probably
section 2.7.1, which has no examples and is completely redundant with
section 2.8.3).


2.8.4 RDF collections
This section is too terse.  Because the example shows an RDF collection
with exactly three items, the reader might infer that
an RDF collection is a triple constructor.  However, the syntax in
Appendix A indicates that much more than a single triple can be written
within an RDF collection.  It would be good to discuss the available
syntactic options, with examples.


2.9 Querying reification vocabulary
typo in second sentence"... can be queried be..." should read
"...can be queried by...".


3.1.4 Matching with RDF D-entailment
An example would be helpful.  For example, knowing that
"42"^^xsd:integer and "042"^^xsd:integer are eqiuvalent literals, the
query SELECT ?x WHERE { ?x a 42 } will match a triple in a dataset
x a "042"^^xsd:integer.


3.2 Value constraints
This topic does not seem to be subordinate to the overall topic of
section 3, "Working with RDF literals".  Perhaps sections 3.2 and 3.3
should be transfered to section 4, "Graph patterns".  Note that section
4 begins with a list of ways to build complex graph patterns, among
which is value constraints, yet value constraints are not described in
any subsection of section 4. 


3.4 Matching values and RDF D-entailment
This section is redundant with section 3.1.4, "Matching
with RDF D-entailment".


4.1 Group graph patterns
The defined term appears to be "group graph pattern".  Consequently
occurrences of "group pattern" should replaced by "group graph pattern".
(For example, last sentence of first paragraph following the box.)


4.1 Group graph patterns
It would be helpful to move the last sentence, ("In a SPARQL query string,
a group graph patern is delimited by braces") earlier in this
section.  Before I reached that sentence, I had a very hard time
deciphering the following sentence: "this query has a group pattern
(sic, 'group graph pattern' is meant)
of one basic graph pattern as the query pattern".  It just seemed like
you were running around in circles.  


4.1 Group graph patterns
It would be helpful to show an example with two consecutive group graph
patterns.  The example already in this section is equivalent to
PREFIX foaf: etc
SELECT ?name ?mbox
WHERE { { ?x foaf:name ?name } { ?x foaf:mbox ?mbox } }


5. Including optional values
It says "RDF is semi-structured".  Actually, RDF is highly structured,
especially compared to XML (which is routinely called semi-structured)
since RDF consists entirely of triples.
This makes RDF even more structured than relational databases
(though RDF is weakly typed compared to most relational databases).
This sentence is not necessary and can be deleted.  The
remainder of the paragraph is still true (for example, in relational
database terms, you are talking about outer joins, which are a
highly useful feature that was absent from the earliest formulations
of relational database technology.)  Being structured is irrelevant to the
utility of optional matching.


7. RDF dataset
typo, First sentence: "comprising of" -> "comprising" or
"consisting of".


7. RDF dataset
The last definition, of "RDF dataset graph pattern", items 1 and 2
refer to "dataset {Gi, (<u1>, G1), ... }". This is confusing because
the preceding definition refers to dataset {G, (<u1>, G1), ... }.
The reader is left wondering whether this is a typo, but if so, what
is the role of the <ui>'s and Gi's?  I think what you are trying to say
in items 1 and 2 is that in the second definition, the Gi gets treated
like the default dataset does in the first definition.  But in that
case, why not unravel the logic for the reader?  The first definition
translates matching a pattern P (other than an RDF dataset graph pattern)
down to matching the default graph.  Why not just use that language
in the second definition as well?  The two items would read:
"1. g is an IRI where g = <ui> for some i, and P matches Gi with solution S.
2. g is a variable, S maps the variable g to <ui>, and P matches Gi
with solution S."


10.1 Solution sequences and result forms
First sentence: "each solution being a function from variables to
RDF terms".  Actually, you mean, "a function from variables and
blank nodes to RDF terms."  See section 2.4 "Pattern solutions".
(However, my prefered resolution is to remove blank nodes from the language,
as noted in a separate comment.)


10.1.1 Projection
You should also note that blank nodes are always projected out of the
solution sequence.  In terms of section 10.1.2 "DISTINCT", this means
that it is possible to get duplicates in the result even if all
variables are retained.  Example:
SELECT ?x WHERE { [] v:loves ?x }
finds all RDF terms that are the object of
the verb v:loves.  If the dataset consists of
"Bob" v:loves "Alice" .
"Carl" v:loves "Alice" .
Then I think the solution sequence is
{ ([] = "Bob", ?x = "Alice"), ([] = "Carl", ?x = "Alice") }. 
After projecting away the blank node, the sequence is { "Alice", "Alice" 
}. 


10.1.3 ORDER BY
Issues in the five-point arbitrary ordering:
1. what is meant by a "plain literal
before an RDF literal with type xsd:string of the same lexical
form"?  The inscrutable terms here are "plain literal", "before" (points
1 through 5 prescribe an ordering, so "before" presumably does not
indicate the ordering, it must mean something else), and "same lexical
form". 
2. is there any order to RDF literals?
Note that there is a paragraph following
the five-point ordering which explains that "IRIs are ordered by
comparing the character strings making up each IRI".
3. Does language tag have any influence on ordering of RDF literals?
4. What is the relative ordering of two literals that have types
of incomparable categories (for example, comparing a numeric and a 
dateTime,
a numeric and an xsd:string, or a dateTime and an xsd:string)?

My conjectured resolution is that point 5 "A plain literal..."
should be eliminated, and point 4 should be amplified with a follow-on
paragraph to clarify the ordering of all RDF literals.  Such a
follow-on paragraph might say, for example,
"Two RDF literals L1 and L2 are ordered as follows:
1. If L1 and L2 are both numeric, both xsd:dateTime, or both
xsd:string, then they are ordered according to the operator '<' in the
Operator mapping table.
2. Otherwise, let LF1 and LF2 be the lexical forms of L1 and L2
(ie, the portion of the literal enclosed in single or double quotes,
after replacing any escape characters by their equivalents).  LF1 and
LF2 are compared using Unicode code point order, applied lexicographically,
to determine the order of L1 and L2.  The language tags of L1 and L2,
if any, are ignored.  If LF1 = LF2, LF1 has no datatype and LF2 has
type xsd:string, then L1 < L2.

This still leaves unanswered what is the relative ordering of the following
pairs:
"12"^^xsd:integer and "12"
"12 ^^xsd:integer and "12"^^xsd:string


10.1.3 ORDER BY
The specification uses the ordering of types numeric, dateTime and
xsd:string, but not xsd:boolean.  Maybe this is not a problem, since
"false" precedes "true" in an alphabetic ordering of the xsd:boolean
type anyway.  Still, it raised my eyebrows that the ordering of
xsd:boolean was not used.



10.1.3 ORDER BY
It says "IRIs are ordered by comparing the character strings making up
each IRI".  Fine, but how does this ordering work?  Perhaps it is
Unicode code point order applied lexicographically to the IRI?


10.1.3 ORDER BY
Do language tags have any role in ordering?  For example, what is the
relative ordering of "the"@en and "the"@fr?


10.2 Selecting variables
It says that "The syntax SELECT * is an abbreviation that selects all
of the named variables".  What is a named variable?  This term is not
defined.  I think all variables have names.
Probably you can just delete "named".


10.3.2 Accessing graphs in the RDF dataset
It might also be interesting to the reader to note that CONSTRUCT
can be used to construct a graph with IRIs that are different from the
IRIs in the input graph.  The technique is to create an xsd:string
corresponding to the desired IRI and cast to IRI type. 


10.3.2 Accessing graphs in the RDF dataset
issues with the definition of graph template:
1. The term "triple pattern" is a poor choice of terminology,
because in most of the specification,
the word "pattern" refers to a pattern to be matched.  A better term
would be "triple template".
2. There is no definition of what S(tj) is.  Perhaps the reader is
supposed to recall the definition
in section 3.3 "Value constraints - definition",
which defines S(C) where C is a constraint.  However, tj is not a
constraint.
It might be a good idea to define (or repeat the definition) of S(tj).


10.4.3 Description of resources
typo, first sentence: "...is the determined by...": delete "the"
typo after second box: "as well information which as name and other...":
possibly you mean "as well as information such as name..." or
possibly "as well as information which has name...".
typo, formal definition, last sentence: "does not proscribe".
"proscribe" means "prohibit"; you want "prescribe", which means
"specify".


11. Testing values
It says "the operands of these functions and operators are the subset
of XML Schema datatypes...".  But the operands are values of these
types, not the types themselves.


11.2.3.1 bound
The text around the examples contains misstatements.  The text
preceding the first sample query says "This query finds the people
without a dc:date property" whereas in fact it finds the people who
do have a dc:date.  The sentence following the second sample query
is also wrong.  It says "Because Alice's mbox was known, "Alice"
was not a solution to the query" but "Alice" does not have a mailbox
and "Alice" is a solution to the query.


11.2.3.3 isBlank
The text before the sample query is a cut-and-paste error, a
duplicate of the text in 11.2.3.2 "isIRI". 


A.2 White space
last sentence "As a hint, rule names below in capitals indicate a
possible choice of terminals".  Who has this possible choice?
There are two consumers of this document: implementers and users.
The entire grammar defines a space of choices for the language user,
so I don't think this sentence is pitched at the user.  I think you
mean "possible choice of terminals for those who are constructing
a SPARQL parser". 


Appendix A.7, Grammar
Rules [43] "Expression" through [51] "UnaryExpression" follow a top-down
pattern in the order of
presentation.  Then rule [51] "UnaryExpression" requires
the definition of PrimaryExpression, which one expects to be
the next BNF.  Actually, PrimaryExpression occurs as rule [58],
and many (though not all) of its constituents appear in rules [52]
"BuiltinCall" through [57] "BracketedExpression".  It would be better
to rearrange the rules in the following order:
[58] "PrimaryExpression"
[57] "BracketedExpression"
[52] "BuiltinCall"
[53] "RegexExpression"
[55] "IRIrefOrFunction"
[56] "ArgList"

As for [54] "FunctionCall", this rule is used in [16] OrderCondition,
but, in a separate comment, I think that arbitrary expressions should
be permitted in OrderCondition, not just function calls.


Appendix D, Collected formal definitions
This appendix is listed in the table of content but is not present,
not even as a to-be-done item.  I would appreciate having this
appendix.  I suspect that the current formal definitions have
omissions, inconsistencies, etc., but it is very hard to check currently.


No particular location
The terms "solution" and "pattern solution" are both in use in the
document.  For consistency it would be better to pick one of these terms
and use it exclusively.

Fred Zemke
Received on Thursday, 12 January 2006 17:48:02 UTC