Query review, part 2 (ACTION-546) from Gregory Williams on 2011-12-06 (public-rdf-dawg@w3.org from October to December 2011)

From: Gregory Williams <greg@evilfunhouse.com>
Date: Tue, 6 Dec 2011 07:40:53 -0500
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <70AB0E03-8AF8-4371-8E10-7D2A395AE5D8@evilfunhouse.com>
Here's the second part of my review of the query document. I had a few questions about aggregate handling that aren't dealt with here because I see things are still being changed in that section. As soon as it stabilizes, hopefully my questions will simply be resolved.

thanks,
.greg



=== 18.1.3 (Definition: RDF Dataset Merge)

s/equal to <uk> N2/equal to <uk> in N2/

=== 18.1.7

"We call the object of tn the end of the path."
This depends on n = length(ST)-1, but I don't see that being defined anywhere.

=== 18.1.10

"A SPARQL Abstract Query is a tuple (E, DS, QF)"
Is QF meant to be simply one of { SELECT, CONSTRUCT, ASK, DESCRIBE }? If so, does the abstract query contain either the CONSTRUCT pattern or the DESCRIBE list (where relevant)?

=== 18.2

"Property path expressions are written to produce triple patterns and introduce four forms, ZeroLengthPath, ZeroOrMorePath, OneOrMorePath, and NegatedPropertySet."
I don't really understand this. "to produce triple patterns" sounds like a discussion of only fixed-length property paths, but the "forms" discussed are for property paths that aren't simply equivalent to triple patterns. Also, it's not clear what "form" is meant to mean here as the subsequent text calls them "symbols in the SPARQL algebra."

I see you added Group and AggregateJoin to the list of algebra symbols, but I think the table is still missing Aggregation.

=== 18.2.1

In the in-scope rules table, the rule for "Group { P1 P2 ... }" is formatted in a way that makes "Group" seem like SPARQL syntax, but I believe it's meant to just convey the syntax form for a GGP, right?

The rule for "SERVICE term {P}" seems to allow "term" to be a variable, but that's not going to be part of the federation spec, right?

Some of the in-scope table entries seem to describe the *condition* for when the variable is in-scope (such as when "v occurs in the BGP"), but others seem to simply describe the in-scope rule:
- "v is in-scope" for the "(expr AS v)" form
- "v is in-scope if v is mentioned as a project variable" for the "SELECT ..v.. { P }" form
- "v is in-scope if v is in varlist" for the "BINDINGS varlist (values)" form

=== 18.2.2

"Applying the simpification step after all the translation of graph patterns is the preferred reading."
Should this sentence be inside the red-outlined note box?

=== 18.2.2.5

"We introduce the following symbols: Join(Pattern, Pattern), LeftJoin(Pattern, Pattern, expression), Filter(expression, Pattern)"
Is this necessary? The already-used symbols Union, Graph, and exists aren't explicitly introduced in this way, but all of these are mentioned in the table in 18.2.

"Let G := the empty pattern, Z, a basic graph pattern which is the empty set."
I can't parse this sentence. Z doesn't seem to be used in this definition.

=== 18.2.3 Examples of Mapped Graph Patterns

Should this section be marked as informative as it's just examples?

"The second form of a rewrite example is the first with empty group joins removed by the simplification step."
I'm not sure I understand this sentence.

"BGP( ?s :p1 ?v1 .?s :p2 ?v2 )"
The whitespace is odd in this syntax, but I'm more curious about the choice of '.' as a separator for triples in the serialization of the BGP algebra.

"""
Union( 
    Union( BGP(?s :p1 ?v1) ,
           BGP(?s :p2 ?v2),
    BGP(?s :p3 ?v3))
"""
The parens don't balance here.

"""
LeftJoin(
    Join(Z, BGP(?s :p1 ?v1)),
    Join(Z, BGP(?s :p2 ?v2)) ),
    true)
"""
The parens don't balance here.


"{ ?s :p1 ?v1 FILTER (?v1 < 3 ) OPTIONAL {?s :p2 ?v2} } }"
The braces don't balance here.

=== 18.2.4.1

"If the GROUP BY keyword is used, or there is implicit grouping due to the use of aggregates in the projection..."
Is it possible to have an implicit grouping based on the use of aggregates in only the HAVING clause, and not the projection?

"It divides the solution into groups of one or more solutions..."
s/the solution/the solutions/ or /the solution set/

=== 18.2.4.2

Is the algorithm given in this section redundant with the end of the algorithm given in 18.2.4.1?

=== 18.2.4.3

What is 'M' in this section? I think can figure it out by context, but I think it should be made explicit.

=== 18.2.4.4

In the altorithm in this section, 'union' is spelled out, but earlier (e.g. in 18.2.2.5) the union character (U+222A) is used.

"variable must not appear in VS; if it does then generate a syntax error and stop"
I think this should also prevent the variable from appearing in P (the list of already projected variables).

=== 18.2.5

"The solution modifiers are applied to a query in the following order: ... Offset, Limit."
I'm not sure what applying "to a query" means. Maybe 'applied to a solution sequence'?
Also, if we're talking about applying the modifiers to a solution set/sequence, then Offset/Limit should instead be the single Slice operation as OFFSET/LIMIT are just syntactic expressions of the same modifier.

=== 18.2.5.2

"The set of projection variables was calculated in the processing of SELECT expressions."
Can a link be added to section 18.2.4.4?

"where vars is the set of variables mentioned in the SELECT clause or all named variables that are in-scope in the query if SELECT * used."
I think the wording here should include the variable P that is constructed in 18.2.4.4. Otherwise, I think "mentioned in the SELECT clause" might be ambiguous.

=== 18.3

Definition: Compatible Mappings
"μ1(v) = μ2(v)"
has this syntax been introduced before to mean the term mapped to variable v in μ?
Also, do we need to be explicit about what the equality operation is doing here (i.e. is it sameTerm, entailment-based, etc.)?

=== 18.3.2

"Since SPARQL treats blank node identifiers in a SPARQL Query Results XML Format document..."
This should be generalized to include the other result formats.

=== 18.4 (Definition: Diff)

"Let Ω1 and Ω2 be multisets of solution mappings."
The definition for LeftJoin also includes "and expr be an expression". Should it be included here?

=== 18.4 (Definition: Evaluation of NegatedPropertySet)

Is there a reason the title of this section uses "Evaluation of" when the other path operators don't?

"... and write μ' as the extension of a solution mapping such that μ'(μ,x) = μ(x) if x is a variable and μ'(μ,t) = t if t is an RDF term;"
I don't understand this as written, and don't see anything that would indicate the actual definition of a negated property set. Also, I wonder why the sentence ends in a semicolon.

=== 18.4 (Definition: Extend)

"expr be an expression"
This links to section 17 for 'expression', but other uses of this phrase don't include the link.

=== 18.4

"Write [x | C] for a sequence of elements where C(x) is true."
I take it this is trying to introduce the list equivalent to the established use of {x | C} for sets? If so, I'm not sure "C(x)" makes sense when this syntax is used, e.g. in "OrderBy(Ψ, condition) = [ μ | μ in Ψ and the sequence satisfies the ordering condition]".

=== 18.4 (Definition: ToMultiSet)

"ToMultiSet(Ψ) = { μ | μ ∈ Ψ }"
This uses the element of character (U+2208), but other definitions simply use "in" (e.g. in "Reduced(Ψ) = [ μ | μ in Ψ ]").

=== 18.4 (Definition: ListEval)

"ListEval((expr1, ..., exprn), μ) returns a list (e1, ..., en), where ei = μ(exprlisti) or error."
Is "exprlisti" meant to be "expri"?
This definition for evaluating a list of expressions seems like it's missing a subordinate way to indicate evaluating a single expression.

"Group, a function which groups a solution sequence into multiple solutions, based on some attribute of the solutions."
Is this meant to be in section 18.4.1? And is part of it missing to turn it from a noun phrase into a sentence?

=== 18.4.1

"Aggregation, a function which calculates a scalar value as an output of the aggregate expression in the SELECT clause, and in the HAVING evaluation process."
Another noun phrase.

"returns the multiset { l | L in M and l in L }"

At least in my browser, the lowercase L and the vertical line look almost identical here. Could the lowercase L be changed for some other character?

=== 18.4.1.2

How is COUNT(DISTINCT) handled? (I suspect the answer to this question also affects section 18.2.4.1 Grouping and Aggregation.) I see that DISTINCT pops back up in the evaluation semantics for Aggregation, but it's not clear to me how that information is available at that point.

=== 18.4.1.3

"Sum(S) = 0 when card[S] = 0"
Does 0 here need an explicit datatype? Or is integer implied by the lexical form used? Similarly for Avg.

=== 18.4.1.4

"Min and Max are SPARQL set functions that return the minimum and maximum value from a group respectively."
It seems strange for this to appear before the sections for Min and Max, but inside the section for Avg.

=== 18.4.1.5

"literal Min(multiset M)"
Why does the aggregate signature indicate it returns a literal? Can't Min and Max be used over IRIs?

"Min(M) = Min(ToList(Flatten(M)))"
Is there a way to express this including the syntactic constructs for ordering, instead of having to note that this definition relies on ordering in the subsequent text? Similarly for Max.

=== 18.4.1.7

"Sample is a set function which returns an arbitrary value from the multiset passed to it."
This should be in section 18.4.1.8 (Sample).

=== 18.4.1.8

"literal Sample(multiset M)"
Like Min and Max, Sample needn't be restricted to literals (and is probably more general as it can return any terms).

=== 18.4 (Definition: ToMultiset)

The formatting here makes it look like the definitions for ToMultiset and Exists are within section 18.4.1.8 (Sample).

"We define the expression function "exists" using 'substitute':"
I found this confusing as 'substitute' isn't used until section 18.5.

=== 18.5

Definitions are given at the top for D, D(G), D[i], P, P1, P2, and L, but the second evaluation definition (Filter) uses F which isn't defined similarly.

"Two filter functions in support of the evaluation of EXISTS and NOT EXISTS forms which were translated to exists are defined:"
I only see the definition of 'substitute' defined here. What is the second function?

=== 18.5 (Definition: Evaluation of Aggregation)

"Aggregation applies a set function “func”..."
This is the first place I noticed it, but there are several places in the document that use smart quotes. Was that intentional?

=== 18.5 (Definition: Evaluation of AggregateJoin)

"Write A = (A1, A2, ...) where Ai = Aggregation(exprListi, funci, scalarvarsi, P)"
The "i" in "Ai" isn't properly subscripted.

"Note that if eval(D(G), Ai) is an error, it is ignored."
The "i" in "Ai" isn't properly subscripted.


=== 18.5 (Definition: Evaluation of ZeroLengthPath)

Is there a reason that ZeroLengthPath(X, path, Y) needs to include 'path'?

Unlike the definitions for ZeroOrMorePath and OneOrMorePath, this one doesn't start with "eval(D(G), ZeroLengthPath(X, path, Y))". I'm not sure if that means it's missing here, or needless in the other definitions.

"eval(D(G), ZeroLengthPath(vx:var, path, vy:var))) =  { {(vx, term), (vy, term)} | term in nodes(G) }"
'nodes(G)' should be capitalized as Nodes(G) as introduced in "Definition: Node set of a graph". Similarly in "Definition: Evaluation of ZeroOrMorePath".

=== 18.5 (Definition: Evaluation of ZeroOrMorePath)

As mentioned above, this definition starts with "eval(D(G), ZeroOrMorePath(X, path, Y)". I'm not sure if it's needed, but if it is, the parens are unbalanced.

The 'term path term' form is defined as:
"""
eval(D(G), ZeroOrMore(x:term, path, y:term)) = 
    { { } } if (x,vy:var) in eval(D(G), ZeroOrMore(x, path, vy); card[{ }] = 1 
"""
I don't understand this formulation, as I understand eval(D(G), ZeroOrMore(...)) as returning multisets of (var, term) pairs, but this seems to be looking for a (term, var) pair. Why isn't this as simple as "{ {} } if y in ALP(x, path), card[] = 1" (the opposite of the negative case which returns the empty multiset)?

=== 18.5 (Definition: Evaluation of OneOrMorePath)

There's useless whitespace above this definition, but not above the previous one.

As above, this definition starts with 'eval(D(G), OneOrMorePath(X, path, Y))'. Not sure if it's necessary.

As above, the definition for the 'term path term' form seems to be looking for a (term, var) pair in the return from eval(D(G), OneOrMore(...)).

=== 18.5 (Definition: Evaluation of NegatedPropertySet)

As above in 18.4, I'm not sure how to interpret the syntactic form "μ'(μ,x)", nor what exactly μ should contain in this definition (if anything) beyond mappings for x and y.

The use of "X" and "Y" are used in this definition in both upper- and lower-case forms.

"eval(D(G), NPS(X, S, Y)) = { μ | μ'(μ,x) is a subject of active G, μ'(μ,y) is a object of active G, and triple(μ'(μ,x), p, μ'(μ,y)) does not occur in G, for all p in S }"
This seems to suggest that μ'(μ,x) and μ'(μ,y) don't need to be the subject and object of the same triple. For example, if G contains:
:x :p1 :y
:y :p2 :z
it seems this definition would suggest NPS(:x, {:p1}, :z) = { {} }, which I don't think is right. I think instead of the "is a (object|subject)" conditions, the definition needs the condition: "there exists some q not in S s.t. triple(μ'(μ,x), q, μ'(μ,y)) in G".

=== 18.6.1

"SG will often be graph equivalent to AG, but restricting this to E-equivalence allows some forms of normalization, for example elimination of semantic redundancies, to be applied to the source documents before querying."
I'm not sure what "source documents" means here. What I think I understand from this is an indication that the entailment might eliminate redundancies in the underlying RDF, but while that's true, I think it's also true of any SPARQL system insofar as SPARQL Query only discusses query evaluation *after* data is somehow populated in the working dataset. In fact, it may be the case that there never is a "source document," as the RDF may be input (and redundancies eliminated) directly via an API.

"This allows query protocols in which blank node identifiers retain their meaning between the query and the source document, or across multiple queries."
Again regarding "source document."
Received on Tuesday, 6 December 2011 12:41:46 UTC