Re: Query review, part 2 (ACTION-546) from Andy Seaborne on 2011-12-08 (public-rdf-dawg@w3.org from October to December 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Thu, 08 Dec 2011 15:45:23 +0000
To: public-rdf-dawg@w3.org
Message-ID: <4EE0DB93.6080701@epimorphics.com>
Greg - thanks for the review.

On 06/12/11 12:40, Gregory Williams wrote:
> Here's the second part of my review of the query document. I had a few questions about aggregate handling that aren't dealt with here because I see things are still being changed in that section. As soon as it stabilizes, hopefully my questions will simply be resolved.

As of yesterday, I believe the aggregate section is now finished.

Steve - various "@@Steve".  I've made the editorial changes where it 
looks like an obvious change to be made but not addressed non-editorial 
comments.

 Andy

>
> thanks,
> .greg
>
>
>
> === 18.1.3 (Definition: RDF Dataset Merge)
>
> s/equal to<uk>  N2/equal to<uk>  in N2/

Done

>
> === 18.1.7
>
> "We call the object of tn the end of the path."
> This depends on n = length(ST)-1, but I don't see that being defined anywhere.

Fixed

>
> === 18.1.10
>
> "A SPARQL Abstract Query is a tuple (E, DS, QF)"
> Is QF meant to be simply one of { SELECT, CONSTRUCT, ASK, DESCRIBE }? If so, does the abstract query contain either the CONSTRUCT pattern or the DESCRIBE list (where relevant)?

It would if we had formalized CONSTRUCT, ASK, DESCRIBE.

>
> === 18.2
>
> "Property path expressions are written to produce triple patterns and introduce four forms, ZeroLengthPath, ZeroOrMorePath, OneOrMorePath, and NegatedPropertySet."
> I don't really understand this. "to produce triple patterns" sounds like a discussion of only fixed-length property paths, but the "forms" discussed are for property paths that aren't simply equivalent to triple patterns. Also, it's not clear what "form" is meant to mean here as the subsequent text calls them "symbols in the SPARQL algebra."

Is this better?

"Property path expressions are written to produce triple patterns
  and algebra forms, ZeroLengthPath, ZeroOrMorePath, OneOrMorePath,
and NegatedPropertySet as necessary."

Other suggestions?

"form" meaning "symbols in the SPARQL algebra." is fine.  At this stage, 
it's a rewrite of (abstract) syntax to (abstract) algebra as a data 
structure.

> I see you added Group and AggregateJoin to the list of algebra symbols, but I think the table is still missing Aggregation.

"Aggregation" added.

>
> === 18.2.1
>
> In the in-scope rules table, the rule for "Group { P1 P2 ... }" is formatted in a way that makes "Group" seem like SPARQL syntax, but I believe it's meant to just convey the syntax form for a GGP, right?

Correct - but removing the word risk confusing "{ P1 P2 ...}" and a form 
taking a keyword and a {} part , e.g. OPTIONAL { P1 P2 } or {}UNION{}

>
> The rule for "SERVICE term {P}" seems to allow "term" to be a variable, but that's not going to be part of the federation spec, right?

It's going in non-normatively and so the syntax and handling should be 
done where it does not create problems.

> Some of the in-scope table entries seem to describe the *condition* for when the variable is in-scope (such as when "v occurs in the BGP"), but others seem to simply describe the in-scope rule:
> - "v is in-scope" for the "(expr AS v)" form
> - "v is in-scope if v is mentioned as a project variable" for the "SELECT ..v.. { P }" form
> - "v is in-scope if v is in varlist" for the "BINDINGS varlist (values)" form

Do you see this as a problem?

It would be nice to write in rule form but writing that for a BGP is 
going to be verbose.

To convert would merely to move the v to the other column and have

"(expr AS X) for .."  => "v is in-scope if v = X"

>
> === 18.2.2
>
> "Applying the simpification step after all the translation of graph patterns is the preferred reading."
> Should this sentence be inside the red-outlined note box?

The box is as-is from SPARQL 1.0 and the sentence added for SPARQL 1.1. 
  We decided IIRC to leave the box alone.  (I'd prefer to remove it all 
together.)

> === 18.2.2.5
>
> "We introduce the following symbols: Join(Pattern, Pattern), LeftJoin(Pattern, Pattern, expression), Filter(expression, Pattern)"
> Is this necessary? The already-used symbols Union, Graph, and exists aren't explicitly introduced in this way, but all of these are mentioned in the table in 18.2.

I see your point.  The simpler translations don't mention the one symbol 
they introduce.

This is unchanged from SPARQL 1.0 and I'm inclined to leave things as 
they are because people may look through and see a change-in-text which 
in fact isn't changing SPARQL.

> "Let G := the empty pattern, Z, a basic graph pattern which is the empty set."
> I can't parse this sentence. Z doesn't seem to be used in this definition.

Z is used in the simpification step which used to be directly after this 
block.

Removed from this point in the doc.

Expanded the simplification section text.

> === 18.2.3 Examples of Mapped Graph Patterns
>
> Should this section be marked as informative as it's just examples?

We have examples in many normative sections elsewhere.

A different style would have been a completely formal normative part of 
the doc with introductory and example material elsewhere.

Having the examples near the material they show is also helpful.

> "The second form of a rewrite example is the first with empty group joins removed by the simplification step."
> I'm not sure I understand this sentence.

It shows the before-and-after of simpification.

Do you have better wording to suggest?

>
> "BGP( ?s :p1 ?v1 .?s :p2 ?v2 )"
> The whitespace is odd in this syntax, but I'm more curious about the choice of '.' as a separator for triples in the serialization of the BGP algebra.

Whitespace fixed.

DOT separated triples in SPARQL syntax so I left it in to show the 
division of triples.

> """
> Union(
>      Union( BGP(?s :p1 ?v1) ,
>             BGP(?s :p2 ?v2),
>      BGP(?s :p3 ?v3))
> """
> The parens don't balance here.
>
> """
> LeftJoin(
>      Join(Z, BGP(?s :p1 ?v1)),
>      Join(Z, BGP(?s :p2 ?v2)) ),
>      true)
> """
> The parens don't balance here.

Fixed.

> "{ ?s :p1 ?v1 FILTER (?v1<  3 ) OPTIONAL {?s :p2 ?v2} } }"
> The braces don't balance here.

Fixed.

>
> === 18.2.4.1
>
> "If the GROUP BY keyword is used, or there is implicit grouping due to the use of aggregates in the projection..."
> Is it possible to have an implicit grouping based on the use of aggregates in only the HAVING clause, and not the projection?

Yes and no.

Observation: if you have just "HAVING aggregate", then there is no 
possible legal SELECT clause.

Only GROUP BY variables and aggregates would be legal
There aren't any GROUP BY variables and you are askign about none in 
projection.

SELECT * is illegal if there is an aggregate (implicit group).

So I hope the answer is "yes" and it just falls out there are no legal 
queries.

@@Steve?

> "It divides the solution into groups of one or more solutions..."
> s/the solution/the solutions/ or /the solution set/

@@Steve

I'd like to see a box here because Group(...) is used in the next step.

> === 18.2.4.2
>
> Is the algorithm given in this section redundant with the end of the algorithm given in 18.2.4.1?

@Steve

/me "yes" but I'd remove it from 18.2.4.1  It's not in the example of 
18.2.4.1.

>
> === 18.2.4.3
>
> What is 'M' in this section? I think can figure it out by context, but I think it should be made explicit.

Fixed.

And reformatting applied.

> === 18.2.4.4
>
> In the altorithm in this section, 'union' is spelled out, but earlier (e.g. in 18.2.2.5) the union character (U+222A) is used.

Changed.

>
> "variable must not appear in VS; if it does then generate a syntax error and stop"
> I think this should also prevent the variable from appearing in P (the list of already projected variables).

VS is defined as the variables in the { pattern } from above.

> === 18.2.5
>
> "The solution modifiers are applied to a query in the following order: ... Offset, Limit."
> I'm not sure what applying "to a query" means. Maybe 'applied to a solution sequence'?

I think the sentence before:
"Solutions modifiers apply to the processing of a SPARQL query after 
pattern matching."

puts the use of query here in context.  It is the query begin 
transformed and we're producing an algebra form.  Taking about "applied" 
seems more execution time to me.

> Also, if we're talking about applying the modifiers to a solution set/sequence, then Offset/Limit should instead be the single Slice operation as OFFSET/LIMIT are just syntactic expressions of the same modifier.

We're talking about syntax elements of the query syntax.

> === 18.2.5.2
>
> "The set of projection variables was calculated in the processing of SELECT expressions."
> Can a link be added to section 18.2.4.4?

Do you mean from 18.2.5.2 to 18.2.4.4 (and not "add a link to the 
contents of 18.2.4.4")?

Done.

> "where vars is the set of variables mentioned in the SELECT clause or all named variables that are in-scope in the query if SELECT * used."
> I think the wording here should include the variable P that is constructed in 18.2.4.4. Otherwise, I think "mentioned in the SELECT clause" might be ambiguous.

"The set of projection variables, PV, was calculated in the
processing of SELECT expressions." (with link)

(and changed P to PV earlier.)

>
> === 18.3
>
> Definition: Compatible Mappings
> "μ1(v) = μ2(v)"
> has this syntax been introduced before to mean the term mapped to variable v in μ?

Sec 18.1.8 says it's a partial function and μ(v) is function application.

I have rewritten the -> forms just above this point as the -> form isn't 
used for this anymore.

> Also, do we need to be explicit about what the equality operation is doing here (i.e. is it sameTerm, entailment-based, etc.)?

It's sameTerm (which follows because μ(v) is a term).

Entailment happens inside BGP matching.

Added:

"Here, μ1(v) = μ2(v) means that μ1(v) and μ2(v) are the same RDF term."

>
> === 18.3.2
>
> "Since SPARQL treats blank node identifiers in a SPARQL Query Results XML Format document..."
> This should be generalized to include the other result formats.

Text/Links to the 3 docs added.

> === 18.4 (Definition: Diff)
>
> "Let Ω1 and Ω2 be multisets of solution mappings."
> The definition for LeftJoin also includes "and expr be an expression". Should it be included here?

Done

>
> === 18.4 (Definition: Evaluation of NegatedPropertySet)
>
> Is there a reason the title of this section uses "Evaluation of" when the other path operators don't?

Fixed

>
> "... and write μ' as the extension of a solution mapping such that μ'(μ,x) = μ(x) if x is a variable and μ'(μ,t) = t if t is an RDF term;"
> I don't understand this as written, and don't see anything that would indicate the actual definition of a negated property set. Also, I wonder why the sentence ends in a semicolon.

The definition has been rewritten as:


"""
A NegatedPropertySet NPS(X, S, Y), where X and Y are
variables or RDF terms, and S is a set of IRIs,
describes a match where X and Y are the subject and object respectively
of a triple but the property IRI of the triple is not one of the IRIs in S.
"""
and the details are in the evaluation definition.

(see later comment)

> === 18.4 (Definition: Extend)
>
> "expr be an expression"
> This links to section 17 for 'expression', but other uses of this phrase don't include the link.

True.  Extend is the form that uses the value of an expression, not just 
it boolean evaluation.

I'm not inclined to reformat all the other sections at this stage so 
left as-is.

> === 18.4
>
> "Write [x | C] for a sequence of elements where C(x) is true."
> I take it this is trying to introduce the list equivalent to the established use of {x | C} for sets? If so, I'm not sure "C(x)" makes sense when this syntax is used, e.g. in "OrderBy(Ψ, condition) = [ μ | μ in Ψ and the sequence satisfies the ordering condition]".

Yes - it's sequence notation.

Is this better?

"Write [ x | C ] for a sequence of elements where C is a condition on x."

> === 18.4 (Definition: ToMultiSet)
>
> "ToMultiSet(Ψ) = { μ | μ ∈ Ψ }"
> This uses the element of character (U+2208), but other definitions simply use "in" (e.g. in "Reduced(Ψ) = [ μ | μ in Ψ ]").

It's not really right either. ∈ is usually set membership, and this 
isn't a set.  Changed in "in"

> === 18.4 (Definition: ListEval)
>
> "ListEval((expr1, ..., exprn), μ) returns a list (e1, ..., en), where ei = μ(exprlisti) or error."
> Is "exprlisti" meant to be "expri"?

Looks like it.  Changed.

@@Steve

> This definition for evaluating a list of expressions seems like it's missing a subordinate way to indicate evaluating a single expression.

@@Steve

> "Group, a function which groups a solution sequence into multiple solutions, based on some attribute of the solutions."
> Is this meant to be in section 18.4.1? And is part of it missing to turn it from a noun phrase into a sentence?

@@Steve

> === 18.4.1
>
> "Aggregation, a function which calculates a scalar value as an output of the aggregate expression in the SELECT clause, and in the HAVING evaluation process."
> Another noun phrase.

@@Steve

> "returns the multiset { l | L in M and l in L }"
>
> At least in my browser, the lowercase L and the vertical line look almost identical here. Could the lowercase L be changed for some other character?

Same here.  Changed to x

> === 18.4.1.2
>
> How is COUNT(DISTINCT) handled? (I suspect the answer to this question also affects section 18.2.4.1 Grouping and Aggregation.) I see that DISTINCT pops back up in the evaluation semantics for Aggregation, but it's not clear to me how that information is available at that point.

@@Steve

>
> === 18.4.1.3
>
> "Sum(S) = 0 when card[S] = 0"
> Does 0 here need an explicit datatype? Or is integer implied by the lexical form used? Similarly for Avg.

@Steve

(it is "0"^^xsd:integer)

Should we put in "0"^^xsd:integer explicitly?

> === 18.4.1.4
>
> "Min and Max are SPARQL set functions that return the minimum and maximum value from a group respectively."
> It seems strange for this to appear before the sections for Min and Max, but inside the section for Avg.

@@Steve

I'm assuming it's misplaced text and I've moved to the start of "min" 
but, Steve, that's just so it's better then currently.  Maybe something 
else was meant.

> === 18.4.1.5
>
> "literal Min(multiset M)"
> Why does the aggregate signature indicate it returns a literal? Can't Min and Max be used over IRIs?

@@Steve

Good question.  I tried this in ARQ and it does use the ordering 
condition and that's compatible with the definition, except the literal 
bit.  (I was not sure what ARQ was going to do here!)

> "Min(M) = Min(ToList(Flatten(M)))"
> Is there a way to express this including the syntactic constructs for ordering, instead of having to note that this definition relies on ordering in the subsequent text? Similarly for Max.

@@Steve

> === 18.4.1.7
>
> "Sample is a set function which returns an arbitrary value from the multiset passed to it."
> This should be in section 18.4.1.8 (Sample).

Moved.

>
> === 18.4.1.8
>
> "literal Sample(multiset M)"
> Like Min and Max, Sample needn't be restricted to literals (and is probably more general as it can return any terms).

@@Steve

> === 18.4 (Definition: ToMultiset)
>
> The formatting here makes it look like the definitions for ToMultiset and Exists are within section 18.4.1.8 (Sample).

Moved, and I moved ListEval into the aggregate section just after Group.

@@Steve

>
> "We define the expression function "exists" using 'substitute':"
> I found this confusing as 'substitute' isn't used until section 18.5.

Text removed.

> === 18.5
>
> Definitions are given at the top for D, D(G), D[i], P, P1, P2, and L, but the second evaluation definition (Filter) uses F which isn't defined similarly.

Added

>
> "Two filter functions in support of the evaluation of EXISTS and NOT EXISTS forms which were translated to exists are defined:"
> I only see the definition of 'substitute' defined here. What is the second function?

Fixed (there is only one).

>
> === 18.5 (Definition: Evaluation of Aggregation)
>
> "Aggregation applies a set function “func”..."
> This is the first place I noticed it, but there are several places in the document that use smart quotes. Was that intentional?

@@Steve

It's probably easier to use fixed quotes.

But else where thing are italicised to give a name to a concept.

> === 18.5 (Definition: Evaluation of AggregateJoin)
>
> "Write A = (A1, A2, ...) where Ai = Aggregation(exprListi, funci, scalarvarsi, P)"
> The "i" in "Ai" isn't properly subscripted.

There are now.  Must have got fixed recently.

> "Note that if eval(D(G), Ai) is an error, it is ignored."
> The "i" in "Ai" isn't properly subscripted.

Ditto.

> === 18.5 (Definition: Evaluation of ZeroLengthPath)
>
> Is there a reason that ZeroLengthPath(X, path, Y) needs to include 'path'?

It is for consistency with the other path operators.

> Unlike the definitions for ZeroOrMorePath and OneOrMorePath, this one doesn't start with "eval(D(G), ZeroLengthPath(X, path, Y))". I'm not sure if that means it's missing here, or needless in the other definitions.

Added

>
> "eval(D(G), ZeroLengthPath(vx:var, path, vy:var))) =  { {(vx, term), (vy, term)} | term in nodes(G) }"
> 'nodes(G)' should be capitalized as Nodes(G) as introduced in "Definition: Node set of a graph". Similarly in "Definition: Evaluation of ZeroOrMorePath".

Changed the definition to "nodes".

> === 18.5 (Definition: Evaluation of ZeroOrMorePath)
>
> As mentioned above, this definition starts with "eval(D(G), ZeroOrMorePath(X, path, Y)". I'm not sure if it's needed, but if it is, the parens are unbalanced.

Fixed

>
> The 'term path term' form is defined as:
> """
> eval(D(G), ZeroOrMore(x:term, path, y:term)) =
>      { { } } if (x,vy:var) in eval(D(G), ZeroOrMore(x, path, vy); card[{ }] = 1
> """
> I don't understand this formulation, as I understand eval(D(G), ZeroOrMore(...)) as returning multisets of (var, term) pairs, but this seems to be looking for a (term, var) pair. Why isn't this as simple as "{ {} } if y in ALP(x, path), card[] = 1" (the opposite of the negative case which returns the empty multiset)?

I think "in" is clear to mean a pair in the multiset of sets but I've 
added {} round.  Does that help?

> === 18.5 (Definition: Evaluation of OneOrMorePath)
>
> There's useless whitespace above this definition, but not above the previous one.

Removed

> As above, this definition starts with 'eval(D(G), OneOrMorePath(X, path, Y))'. Not sure if it's necessary.

Left as is - it picks out the overall form being described.

>
> As above, the definition for the 'term path term' form seems to be looking for a (term, var) pair in the return from eval(D(G), OneOrMore(...)).

As above.

>
> === 18.5 (Definition: Evaluation of NegatedPropertySet)
>
> As above in 18.4, I'm not sure how to interpret the syntactic form "μ'(μ,x)", nor what exactly μ should contain in this definition (if anything) beyond mappings for x and y.

I've added an explicit explanation of μ'

μ'(μ,x) = μ(x) if x is a variable
μ'(μ,t) = t if t is a term

> The use of "X" and "Y" are used in this definition in both upper- and lower-case forms.

Lower-cased.

>
> "eval(D(G), NPS(X, S, Y)) = { μ | μ'(μ,x) is a subject of active G, μ'(μ,y) is a object of active G, and triple(μ'(μ,x), p, μ'(μ,y)) does not occur in G, for all p in S }"
> This seems to suggest that μ'(μ,x) and μ'(μ,y) don't need to be the subject and object of the same triple. For example, if G contains:
> :x :p1 :y
> :y :p2 :z
> it seems this definition would suggest NPS(:x, {:p1}, :z) = { {} }, which I don't think is right. I think instead of the "is a (object|subject)" conditions, the definition needs the condition: "there exists some q not in S s.t. triple(μ'(μ,x), q, μ'(μ,y)) in G".

yes - the eval defn is wrong.

Changed to:

"""
Write μ' as the extension of a solution mapping:
     μ'(μ,x) = μ(x)   if x is a variable
     μ'(μ,t) = t   if t is a RDF term

Let x and y be variables or RDF terms, and S a set of IRIs:

eval(D(G), NPS(x, S, y)) =
  { μ | ∃ triple(μ'(μ,x), p, μ'(μ,y)) in G, such that the IRI of p ∉ S }
"""

>
> === 18.6.1
>
> "SG will often be graph equivalent to AG, but restricting this to E-equivalence allows some forms of normalization, for example elimination of semantic redundancies, to be applied to the source documents before querying."
> I'm not sure what "source documents" means here. What I think I understand from this is an indication that the entailment might eliminate redundancies in the underlying RDF, but while that's true, I think it's also true of any SPARQL system insofar as SPARQL Query only discusses query evaluation *after* data is somehow populated in the working dataset. In fact, it may be the case that there never is a "source document," as the RDF may be input (and redundancies eliminated) directly via an API.
>
> "This allows query protocols in which blank node identifiers retain their meaning between the query and the source document, or across multiple queries."
> Again regarding "source document."

In RDF, there is only syntax in documents.  It does nto really consider 
API construction of data.  I think tha's why it talks about "source 
documents"; it's trying to not use the word "graph".

I'm not sure how the wording here could be changed. Any suggestions?

 Andy
Received on Thursday, 8 December 2011 15:46:05 UTC