Re: Review SPARQL Query 1.1, Section 18 (algebra) from Steve Harris on 2011-03-15 (public-rdf-dawg@w3.org from January to March 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Tue, 15 Mar 2011 23:39:02 +0000
To: Birte Glimm <birte.glimm@comlab.ox.ac.uk>
Cc: SPARQL Working Group <public-rdf-dawg@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>
Message-Id: <3AAC72FA-AF36-40BC-9DD1-ED0812463897@garlik.com>
On 2011-03-15, at 00:34, Birte Glimm wrote:

> Hi Andy, Steve, all,
> I finally managed to complete the review for the algebra section,
> which took me quite a long time and I have quite a number of problems
> with it and/or suggestions for improvement, see below.
> Best regards,
> Birte
> 
> First some comments about non-algebra sections that I noticed while
> going back and forth between the algebra and the examples in the
> previous sections:
> 
> 11.2 The example has an unaliased aggregate, but in 11.1 it says that
> "It should be noted that as per functions, aggregate expressions must
> be aliased in order to project them from queries or subqueries." In
> that sentence the must should be marked as RFC 2119 term:
> <rfc2119>MUST</rfc2119>.
> Furthermore, the variable ?a suddenly disappears in the provided
> solution sequence, so probably ?a was meant to be :a (it is not used
> anywhere anyway).

Ah, thanks.

The ?a has to be there, otherwise you'd get permutations of :x and :y, not pairs. The algebra only illustrates the Aggregate part of the query though.

[snip]

> 18.2.4.1 It divides the solution into *groups of* one or more
> solutions, with the same overall cardinality.
> 
> Algorithm:
> Let P := the *algebra translation* of the group graph pattern of the
> query level
> According to the grammar, the GROUP BY clause can alo contain
> BuiltInCalls and FunctionCalls, which is not handled here and probably
> shouldn't be allowed.

I'm not sure I agree that BuiltInCalls and FunctionCalls are an issue, I think they should come out in the Group(exprlist, P) eval phase.

> The A_i are never combined into A, so A is consistently the empty set.
> Instead of constructing A_i, add the Aggregation(...) to A, or set
> A:=A_1, ..., A_i-1 before building P.

OK, I took that as read, but it can't hurt to spell it out.

> I don't understand whether the outer for loop is indeed iterating over
> all expressions in SElECT and each HAVING, each time executing the
> inner IF and two FOR loops as the indentation suggests or whether the
> second and third for loop run after the first one is done That's what
> I would expect). 

Yes, I guess it could be the other way round - I found it clearer as an operation that loops over the expressions though.

>                   The first and second FOR loop should probably be
> extended to handle SELECT * and the SAMPLE replacement is only
> required if the variable/variable in the expression is also not
> grouped in GROUP BY.

I think * has already been expanded by this point in the execution.

I believe SAMPLE replacement is always needed as you can't eval() a variable over a group.

> It took me a loooong time to understand agg_i. Since aggregates anyway
> have to be aliased, wouldn't it be simpler to use the alias straight
> away and give that even into the Aggregation(...) objects, so that
> proper mappings can be constructed straight away? Even you you don't

That was what I tried to do at first, but I think there's a scope issue that prevents it.

> want to do that, I suggest to give agg_i as the first parameter into
> Aggregation(...) since otherwise it kind of falls from the sky later
> on. Also worth noting that each agg_i must be a fresh variable
> distinct from all other agg_j and variable occurring in Q.
> 
> I don't see why G is needed in AggregateJoin(A, G) since each A_i
> already has G as parameter and G is not needed as a parameter to
> AggregateJoin.

Yes, agreed.

> The example mises a brace before COUNT and the ?x should be ?a. I
> suggest to extend the example so that also the changed Q with agg_i
> instead f the aggregates is shown and A := (A_1, A_2) would help
> understanding (in particular since it is now missing totally I
> guess).

I did try and write out the changed Q, but it was if anything more confusing as it mixes syntax and algebra.

> Definition: ListEval
> I suggest to rephrase
> ListEval(exprlist, μ) returns a list E, where Ei =
> μ(exprlisti) or error.
> into
> ListEval((expr_1, ..., expr_n), μ) returns a list (e_1, ..., e_n), where e_i =
> μ(expr_i) or error.
> It took me a long time to understand that E is the list of E_i, which
> are to be constructed from the given expressions in the list. Making
> the list elements more obvious might help.

OK, this is similar to the A / A_i confusion, your version seems clearer.

> ListEval((unbound)) = (error), as the evaluation of an unbound
> expression is an error.
> Here ListEval is missing an argument (μ):
> ListEval((unbound), μ) = (error)
> 
> Definition: Group
> Uses a "nice" arrow to define mappings for a function. Previously the
> spec used -> for that purpose. The proper symbol would actually be
> what Latex does with \mapsto, which has a small vertical bar at the
> left end of the arrow. I suggest to stick to one notations. e.g., keep
> using ascii ->.

OK, done.

> For the example following the Group Definition, I suggest to first
> give the query and it might be more helpful to use a unary aggregate
> as all the standard ones defined by SPARQL are unary.

Well, without that n-ary aggregate there's no example anywhere that covers why it's a list of lists.

> The grouping actually maps to multisets of existing solution mappings,
> but here the ?x mapping got lost. It might make sense to label the
> rows in the solution sequence table with μ_1 to μ_3, then you can say:
> We produce G = Group((?x), Ω) = { (1) -> { μ_1, μ_2 }, (2) -> { μ_3 } }
> Note that in the spec the example just uses , whereas the definition
> introduces the nice arrow, which is -> elsewhere. Also, the example
> looses the ?x mappings, but according to the Group definition, we
> should map from the key to the solution complete mappings for that
> key.

The version in CVS uses (a, b), rather than (a -> b), though it does make the expression somewhat harder to read.

Labelling the solution does make it easier to follow though.

> I also would like to see the algebra translation since the
> algebra transformation for aggregates rewrites the query. As I
> understand it, we get
> Q = SELECT (?agg_1 AS ?agg) WHERE { ?x ?y ?z } GROUP BY ?x

SELECT (agg_1 AS ?agg) WHERE { ?x ?y ?z } GROUP BY ?x

but close enough.

> and the algebra object:
> AggregateJoin(Aggregration((?y, ?z), ex:agg, {}, G), G)
> from the algebra translation up to the group by and aggregate steps.
> From the select expression translation, I would then get
> Extend(AO, ?agg, ?agg_1)
> with AO being the so far computed algebra object
> AggregateJoin(...). Finally, I get Project(AO, {?agg})
> with AO being Extend(...)
> 
> At this point, it has not yet been defined how Aggregate(...) or
> AggregateJoin(...) is evaluated, so I suggest to say that you come
> back to the example later instead of evaluating Aggregate (but not
> AggregateJoin) although the reader doesn't yet know how that is
> defined.

Yes, I ran into that problem a few times.

> Definition: Flatten
> I suggest to use lower case characters for the actual elements of a
> (multi)set and upper case characters for (multi)sets.
> The Flatten(M) function takes a multiset M = { L_1, ..., L_n } of
> lists and returns { l | L in M and l in L }.

OK, yes, that's clearer.

> or
> The Flatten(M) function takes a multiset M = { L_1, ..., L_n } of
> lists and returns the multiset union of all multisets M_i = { l | l in
> L_i } for all i with 1 <= i <= n.
> 
> Count is a SPARQL set function which counts the number of times a
> given expression has a bound, and non-error value with*in* the
> aggregate group.
> 
> Why is COUNT the only function getting also an xsd:integer err as
> input? The function is unary everywhere else and err is not used in
> the definition.

Ah, that's left over form a previous version.

> All the later aggregate definitions overload the function by defining
> it first for a multiset, which internally defines the same function
> for a list. It can only be assumed that S ind Sum(S) is then a
> list. It might be clearer to say:
> Sum(multiset M) = SumL(ToList(Flatten(M)))
> SumL(list L) = op:numeric-add(...)

S is a sequence, for compatibility with XPath F&O. But I'm not sure we exploit that anymore.

I'm not sure about typing the arguments, but I'll try it tomorrow, and see if it's clearer.

> Also here |S| is used to denote the cardinality, but that hasn't been

Fixed the || v's card[].

> defined. For (multi)sets card[M] is used. Also S_0 and S_1...n is used
> to refer to the first (index 0) element of the list and the sublist
> containing the second to last element, which should then n-1 if you
> start from 0. This should be clarified or you just define

I've changed it to 1-based.

> SumL( (s_1, ..., s_n) ) = ... n > 1
> SumL( (s_1) ) = ... and
> SumL( () ) = 0
> I suggest to put the example outside of the definition and it seems to
> be wrong. It should be:
> Sum({1,2,3})=op:numeric-add(1, op:numeric-add(2, op:numeric-add(3,
> 0))).
> 
> For Min(M) I also suggest to use MinL internally, e.g.,
> Min(M) = MinL( ToList(M) )
> MinL( (l_1, ..., l_n) ) orders (l_1, ..., l_n) according to ORDER BY
> ASC and returns the first element of the ordered sequence.
> Same for Max.

Yes, I think that will be clearer, but I've not made the change yet.

> Definition: Sample
> I suggest to say already in the definition that v is *an arbitrary*
> element from Flatten(M).

I think this is sufficiently clear, but if others agree I'll add another para about it.

> Definition: ToMultiSet
> ... The order and any duplicates in the sequence have no effect on the
> resulting multiset.
> It's not clear what it means for duplicates to have no effect. Do they
> occur only once in the resulting multiset, i.e., only the  first
> element has an effect in that it contributes a value whereas the second
> or more occurrences don't contribute? Or does it means duplicates get
> no special treatment and each one contributes an element? How about
> saying that duplicates in the list are (not) preserved in the
> resulting multiset. An example might also help.

OK.

> The definition for the evaluation of Group(...) (from aggregats) is missing.

... I ran out of time at this point, but I'll come back to it on Thursday ...

> Definition: Evaluation of Aggregation
> Aggregation applies a set function “func” to a multiset of lists of
> expressions and a grouped solution sequence, G as produced by the Group
> function. It produces a single value for each key and partition for that key
> (key, X).
> should be
> Aggregation applies a set function “func” to a multiset of lists of
> expressions and a grouped solution sequence, ***P*** (G is used in
> D(G) for the active graph) as produced by the Group function. It
> produces a single value for each key and partition for that key (key
> ****->***** X).
> M should be defined as:
> M = { L | L = ListEval(exprlist, μ) for μ in ran(g) }
> or
> M = { ListEval(exprlist, μ) | μ in range(g) }
> since ran(g) is a set of solution mappings, whereas ListEval is
> defined for an expression list and one solution mapping. Also
> range(function) has not been defined before although it is pretty
> obvious, but for the domain of a solution there is a special
> definition.
> 
> Special Case: ... will be *the* cardinality of ...
> 
> Can I actually use COUNT(DISTINCT *)? That doesn't seem to make much
> sense to me.
> 
> Note that the purpose of the expression card[range(g)] - card[M] is to
> indicate to the set function the number of expressions that evaluated
> to an error.
> card[range(g)] is only used in COUNT(*), how does that relate to error
> indication in general?
> 
> For example the aggregate expression GROUP_BY(?x ; separator="|")
> should be GROUP_CONCAT
> 
> 
> Definition: Evaluation of AggregateJoin
> ...eval(D(G), AggregateJoin(A, P) ***)***
> missing brace
> 
> I would much prefer (as said above) if the temporary variables agg_i
> were given as parameter to each A_i = Aggregation(...) since otherwise
> the algebra object doesn't contain all the information required to
> evaluate it. I have to know which variable name was used for each A_i,
> since I can hardly assume it is really agg_i given that agg_i is not a
> forbidden variable name and an evil user might use it within a
> query.
> 
> I also assume the definition is wrong, I want
> eval(D(G), AggregateJoin(A, P))={ (agg_1, v_1), ..., (agg_n, v_n) |
> v_i such that ( k, v_i ) in eval(D(G), A_i) for some k and each 1 <= i
> <= n }
> 
> Definition: Evaluation of Extend
> eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P))
> should be
> eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr)
> Extend gets the algebra object as first parameter, which is then
> evaluated into a solution sequence and again first char of Extend
> should be upper case.
> 
> I suggest at this point to go back to the aggregation example from
> 18.4:
> I got
> Project(Extend(AJ, ?agg, ?agg_1), {?agg})
> with
> AJ=AggregateJoin(Aggregration((?y, ?z), ex:agg, {}, G), G)
> and
> G = { (1) -> { μ_1, μ_2 }, (2) -> { μ_3 } }
> Aggregration((?y, ?z), ex:agg, {}, G) gives
> g_1 :  (1) -> { μ_1, μ_2 }
> g_2 : (2) -> { μ_3 }
> so we get
> {
> ( (1), ex:agg( { (2, 3) , (3, 4) }, {} ) ),
> ( (2), ex:agg( { (5, 6) }, {}) )
> }
> Lets for simplicity assume ex:agg is actually SUM and the query had
> SUM(?y), then I would get
> {
> ( (1), Sum( { (2) , (3) } ) ),
> ( (2), Sum( { (5) } ) )
> }
> = { ( (1), 5 ), ( (2), 5 ) }
> From the AggregateJoin we then get
> { { (agg_1, 5) }, { (agg_1, 5) } }
> where each set in the multiset is one solution mapping.
> 
> Extend(...) then gives
> { { (agg_1, 5), (agg, 5) }, { (agg_1, 5), (agg, 5) } }
> although I am not sure where it says that eval(agg_1) gives 5. The
> link for expression links to Section 17 and I can't see that just
> using a variable is fine...
> 
> Finally, Project(...) gives two solutions both mapping agg to 5.
> 
> 
> Definition: Evaluation of ZeroLengthPath
> I suggest to define Nodes(G) for a graph G above the ZeroLengthPath
> definition since it is also used elsewhere, e.g., in the evaluation of
> OneOrMorePath.
> The notion xxx:type is used here for the first time and it might b
> worth saying something like: For a term (or variable) x, x:t
> denotes that x is of type t.
> term in nodes(G) should be term in nodes(D(G))
> card[ ] is not clear
> is the last card[ ] not 0 assuming it denotes the cardinality for each
> mapping in the solution?
> 
> I can't fully understand the function ALP. I guess eval(x, path) needs
> D(G) and then does something like Bgp(x path ?y) evaluated over
> D(G). The path here is a normal predicate I think and it might help to
> make that clear.
> 
> Definition: Evaluation of ZeroOrMorePath
> Should produce multisets of solution mappings, i.e.,
> { { (vy, n) } | n in ALP(x, path) }
> (note extra curly brackets) and not
> { (vy, n) | n in ALP(x, path) }
> same for all eval defs in the definition
> 
> Definition: Evaluation of OneOrMorePath
> eval(D(G), OneOrMorePath(X, path, Y)
> is missing the final closing brace
> 
> still has
> @@Change to algorithmic form??
> 
> Definition: Evaluation of NegatedPropertySet
> where X and Y *are* (not for) variables or RDF terms
> In this definition and the remaining ones, D(G) is required, but only
> D is written
> 
> The second
> Definition: Evaluation of ToList
> should be
> Definition: Evaluation of Subquery
> 
> 18.6
> The overall SPARQL design can be used for queries which assume a more
> elaborate form of entailment than simple entailment, by re-writing the
> matching conditions for basic graph patterns.
> 
> This is no longer true due to PPEs. I am not happy about this at all
> and I assumed that PPEs are optional features. If they are not, it is
> quite unfortunate that the so far existing extension point no longer
> really is one and something has to be done at least to clarify this!
> 
> 
> -- 
> Dr. Birte Glimm, Room 309
> Computing Laboratory
> Parks Road
> Oxford
> OX1 3QD
> United Kingdom
> +44 (0)1865 283520

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Tuesday, 15 March 2011 23:39:40 UTC