- From: Steve Harris <steve.harris@garlik.com>
- Date: Tue, 15 Mar 2011 23:39:02 +0000
- To: Birte Glimm <birte.glimm@comlab.ox.ac.uk>
- Cc: SPARQL Working Group <public-rdf-dawg@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>
On 2011-03-15, at 00:34, Birte Glimm wrote: > Hi Andy, Steve, all, > I finally managed to complete the review for the algebra section, > which took me quite a long time and I have quite a number of problems > with it and/or suggestions for improvement, see below. > Best regards, > Birte > > First some comments about non-algebra sections that I noticed while > going back and forth between the algebra and the examples in the > previous sections: > > 11.2 The example has an unaliased aggregate, but in 11.1 it says that > "It should be noted that as per functions, aggregate expressions must > be aliased in order to project them from queries or subqueries." In > that sentence the must should be marked as RFC 2119 term: > <rfc2119>MUST</rfc2119>. > Furthermore, the variable ?a suddenly disappears in the provided > solution sequence, so probably ?a was meant to be :a (it is not used > anywhere anyway). Ah, thanks. The ?a has to be there, otherwise you'd get permutations of :x and :y, not pairs. The algebra only illustrates the Aggregate part of the query though. [snip] > 18.2.4.1 It divides the solution into *groups of* one or more > solutions, with the same overall cardinality. > > Algorithm: > Let P := the *algebra translation* of the group graph pattern of the > query level > According to the grammar, the GROUP BY clause can alo contain > BuiltInCalls and FunctionCalls, which is not handled here and probably > shouldn't be allowed. I'm not sure I agree that BuiltInCalls and FunctionCalls are an issue, I think they should come out in the Group(exprlist, P) eval phase. > The A_i are never combined into A, so A is consistently the empty set. > Instead of constructing A_i, add the Aggregation(...) to A, or set > A:=A_1, ..., A_i-1 before building P. OK, I took that as read, but it can't hurt to spell it out. > I don't understand whether the outer for loop is indeed iterating over > all expressions in SElECT and each HAVING, each time executing the > inner IF and two FOR loops as the indentation suggests or whether the > second and third for loop run after the first one is done That's what > I would expect). Yes, I guess it could be the other way round - I found it clearer as an operation that loops over the expressions though. > The first and second FOR loop should probably be > extended to handle SELECT * and the SAMPLE replacement is only > required if the variable/variable in the expression is also not > grouped in GROUP BY. I think * has already been expanded by this point in the execution. I believe SAMPLE replacement is always needed as you can't eval() a variable over a group. > It took me a loooong time to understand agg_i. Since aggregates anyway > have to be aliased, wouldn't it be simpler to use the alias straight > away and give that even into the Aggregation(...) objects, so that > proper mappings can be constructed straight away? Even you you don't That was what I tried to do at first, but I think there's a scope issue that prevents it. > want to do that, I suggest to give agg_i as the first parameter into > Aggregation(...) since otherwise it kind of falls from the sky later > on. Also worth noting that each agg_i must be a fresh variable > distinct from all other agg_j and variable occurring in Q. > > I don't see why G is needed in AggregateJoin(A, G) since each A_i > already has G as parameter and G is not needed as a parameter to > AggregateJoin. Yes, agreed. > The example mises a brace before COUNT and the ?x should be ?a. I > suggest to extend the example so that also the changed Q with agg_i > instead f the aggregates is shown and A := (A_1, A_2) would help > understanding (in particular since it is now missing totally I > guess). I did try and write out the changed Q, but it was if anything more confusing as it mixes syntax and algebra. > Definition: ListEval > I suggest to rephrase > ListEval(exprlist, μ) returns a list E, where Ei = > μ(exprlisti) or error. > into > ListEval((expr_1, ..., expr_n), μ) returns a list (e_1, ..., e_n), where e_i = > μ(expr_i) or error. > It took me a long time to understand that E is the list of E_i, which > are to be constructed from the given expressions in the list. Making > the list elements more obvious might help. OK, this is similar to the A / A_i confusion, your version seems clearer. > ListEval((unbound)) = (error), as the evaluation of an unbound > expression is an error. > Here ListEval is missing an argument (μ): > ListEval((unbound), μ) = (error) > > Definition: Group > Uses a "nice" arrow to define mappings for a function. Previously the > spec used -> for that purpose. The proper symbol would actually be > what Latex does with \mapsto, which has a small vertical bar at the > left end of the arrow. I suggest to stick to one notations. e.g., keep > using ascii ->. OK, done. > For the example following the Group Definition, I suggest to first > give the query and it might be more helpful to use a unary aggregate > as all the standard ones defined by SPARQL are unary. Well, without that n-ary aggregate there's no example anywhere that covers why it's a list of lists. > The grouping actually maps to multisets of existing solution mappings, > but here the ?x mapping got lost. It might make sense to label the > rows in the solution sequence table with μ_1 to μ_3, then you can say: > We produce G = Group((?x), Ω) = { (1) -> { μ_1, μ_2 }, (2) -> { μ_3 } } > Note that in the spec the example just uses , whereas the definition > introduces the nice arrow, which is -> elsewhere. Also, the example > looses the ?x mappings, but according to the Group definition, we > should map from the key to the solution complete mappings for that > key. The version in CVS uses (a, b), rather than (a -> b), though it does make the expression somewhat harder to read. Labelling the solution does make it easier to follow though. > I also would like to see the algebra translation since the > algebra transformation for aggregates rewrites the query. As I > understand it, we get > Q = SELECT (?agg_1 AS ?agg) WHERE { ?x ?y ?z } GROUP BY ?x SELECT (agg_1 AS ?agg) WHERE { ?x ?y ?z } GROUP BY ?x but close enough. > and the algebra object: > AggregateJoin(Aggregration((?y, ?z), ex:agg, {}, G), G) > from the algebra translation up to the group by and aggregate steps. > From the select expression translation, I would then get > Extend(AO, ?agg, ?agg_1) > with AO being the so far computed algebra object > AggregateJoin(...). Finally, I get Project(AO, {?agg}) > with AO being Extend(...) > > At this point, it has not yet been defined how Aggregate(...) or > AggregateJoin(...) is evaluated, so I suggest to say that you come > back to the example later instead of evaluating Aggregate (but not > AggregateJoin) although the reader doesn't yet know how that is > defined. Yes, I ran into that problem a few times. > Definition: Flatten > I suggest to use lower case characters for the actual elements of a > (multi)set and upper case characters for (multi)sets. > The Flatten(M) function takes a multiset M = { L_1, ..., L_n } of > lists and returns { l | L in M and l in L }. OK, yes, that's clearer. > or > The Flatten(M) function takes a multiset M = { L_1, ..., L_n } of > lists and returns the multiset union of all multisets M_i = { l | l in > L_i } for all i with 1 <= i <= n. > > Count is a SPARQL set function which counts the number of times a > given expression has a bound, and non-error value with*in* the > aggregate group. > > Why is COUNT the only function getting also an xsd:integer err as > input? The function is unary everywhere else and err is not used in > the definition. Ah, that's left over form a previous version. > All the later aggregate definitions overload the function by defining > it first for a multiset, which internally defines the same function > for a list. It can only be assumed that S ind Sum(S) is then a > list. It might be clearer to say: > Sum(multiset M) = SumL(ToList(Flatten(M))) > SumL(list L) = op:numeric-add(...) S is a sequence, for compatibility with XPath F&O. But I'm not sure we exploit that anymore. I'm not sure about typing the arguments, but I'll try it tomorrow, and see if it's clearer. > Also here |S| is used to denote the cardinality, but that hasn't been Fixed the || v's card[]. > defined. For (multi)sets card[M] is used. Also S_0 and S_1...n is used > to refer to the first (index 0) element of the list and the sublist > containing the second to last element, which should then n-1 if you > start from 0. This should be clarified or you just define I've changed it to 1-based. > SumL( (s_1, ..., s_n) ) = ... n > 1 > SumL( (s_1) ) = ... and > SumL( () ) = 0 > I suggest to put the example outside of the definition and it seems to > be wrong. It should be: > Sum({1,2,3})=op:numeric-add(1, op:numeric-add(2, op:numeric-add(3, > 0))). > > For Min(M) I also suggest to use MinL internally, e.g., > Min(M) = MinL( ToList(M) ) > MinL( (l_1, ..., l_n) ) orders (l_1, ..., l_n) according to ORDER BY > ASC and returns the first element of the ordered sequence. > Same for Max. Yes, I think that will be clearer, but I've not made the change yet. > Definition: Sample > I suggest to say already in the definition that v is *an arbitrary* > element from Flatten(M). I think this is sufficiently clear, but if others agree I'll add another para about it. > Definition: ToMultiSet > ... The order and any duplicates in the sequence have no effect on the > resulting multiset. > It's not clear what it means for duplicates to have no effect. Do they > occur only once in the resulting multiset, i.e., only the first > element has an effect in that it contributes a value whereas the second > or more occurrences don't contribute? Or does it means duplicates get > no special treatment and each one contributes an element? How about > saying that duplicates in the list are (not) preserved in the > resulting multiset. An example might also help. OK. > The definition for the evaluation of Group(...) (from aggregats) is missing. ... I ran out of time at this point, but I'll come back to it on Thursday ... > Definition: Evaluation of Aggregation > Aggregation applies a set function “func” to a multiset of lists of > expressions and a grouped solution sequence, G as produced by the Group > function. It produces a single value for each key and partition for that key > (key, X). > should be > Aggregation applies a set function “func” to a multiset of lists of > expressions and a grouped solution sequence, ***P*** (G is used in > D(G) for the active graph) as produced by the Group function. It > produces a single value for each key and partition for that key (key > ****->***** X). > M should be defined as: > M = { L | L = ListEval(exprlist, μ) for μ in ran(g) } > or > M = { ListEval(exprlist, μ) | μ in range(g) } > since ran(g) is a set of solution mappings, whereas ListEval is > defined for an expression list and one solution mapping. Also > range(function) has not been defined before although it is pretty > obvious, but for the domain of a solution there is a special > definition. > > Special Case: ... will be *the* cardinality of ... > > Can I actually use COUNT(DISTINCT *)? That doesn't seem to make much > sense to me. > > Note that the purpose of the expression card[range(g)] - card[M] is to > indicate to the set function the number of expressions that evaluated > to an error. > card[range(g)] is only used in COUNT(*), how does that relate to error > indication in general? > > For example the aggregate expression GROUP_BY(?x ; separator="|") > should be GROUP_CONCAT > > > Definition: Evaluation of AggregateJoin > ...eval(D(G), AggregateJoin(A, P) ***)*** > missing brace > > I would much prefer (as said above) if the temporary variables agg_i > were given as parameter to each A_i = Aggregation(...) since otherwise > the algebra object doesn't contain all the information required to > evaluate it. I have to know which variable name was used for each A_i, > since I can hardly assume it is really agg_i given that agg_i is not a > forbidden variable name and an evil user might use it within a > query. > > I also assume the definition is wrong, I want > eval(D(G), AggregateJoin(A, P))={ (agg_1, v_1), ..., (agg_n, v_n) | > v_i such that ( k, v_i ) in eval(D(G), A_i) for some k and each 1 <= i > <= n } > > Definition: Evaluation of Extend > eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P)) > should be > eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr) > Extend gets the algebra object as first parameter, which is then > evaluated into a solution sequence and again first char of Extend > should be upper case. > > I suggest at this point to go back to the aggregation example from > 18.4: > I got > Project(Extend(AJ, ?agg, ?agg_1), {?agg}) > with > AJ=AggregateJoin(Aggregration((?y, ?z), ex:agg, {}, G), G) > and > G = { (1) -> { μ_1, μ_2 }, (2) -> { μ_3 } } > Aggregration((?y, ?z), ex:agg, {}, G) gives > g_1 : (1) -> { μ_1, μ_2 } > g_2 : (2) -> { μ_3 } > so we get > { > ( (1), ex:agg( { (2, 3) , (3, 4) }, {} ) ), > ( (2), ex:agg( { (5, 6) }, {}) ) > } > Lets for simplicity assume ex:agg is actually SUM and the query had > SUM(?y), then I would get > { > ( (1), Sum( { (2) , (3) } ) ), > ( (2), Sum( { (5) } ) ) > } > = { ( (1), 5 ), ( (2), 5 ) } > From the AggregateJoin we then get > { { (agg_1, 5) }, { (agg_1, 5) } } > where each set in the multiset is one solution mapping. > > Extend(...) then gives > { { (agg_1, 5), (agg, 5) }, { (agg_1, 5), (agg, 5) } } > although I am not sure where it says that eval(agg_1) gives 5. The > link for expression links to Section 17 and I can't see that just > using a variable is fine... > > Finally, Project(...) gives two solutions both mapping agg to 5. > > > Definition: Evaluation of ZeroLengthPath > I suggest to define Nodes(G) for a graph G above the ZeroLengthPath > definition since it is also used elsewhere, e.g., in the evaluation of > OneOrMorePath. > The notion xxx:type is used here for the first time and it might b > worth saying something like: For a term (or variable) x, x:t > denotes that x is of type t. > term in nodes(G) should be term in nodes(D(G)) > card[ ] is not clear > is the last card[ ] not 0 assuming it denotes the cardinality for each > mapping in the solution? > > I can't fully understand the function ALP. I guess eval(x, path) needs > D(G) and then does something like Bgp(x path ?y) evaluated over > D(G). The path here is a normal predicate I think and it might help to > make that clear. > > Definition: Evaluation of ZeroOrMorePath > Should produce multisets of solution mappings, i.e., > { { (vy, n) } | n in ALP(x, path) } > (note extra curly brackets) and not > { (vy, n) | n in ALP(x, path) } > same for all eval defs in the definition > > Definition: Evaluation of OneOrMorePath > eval(D(G), OneOrMorePath(X, path, Y) > is missing the final closing brace > > still has > @@Change to algorithmic form?? > > Definition: Evaluation of NegatedPropertySet > where X and Y *are* (not for) variables or RDF terms > In this definition and the remaining ones, D(G) is required, but only > D is written > > The second > Definition: Evaluation of ToList > should be > Definition: Evaluation of Subquery > > 18.6 > The overall SPARQL design can be used for queries which assume a more > elaborate form of entailment than simple entailment, by re-writing the > matching conditions for basic graph patterns. > > This is no longer true due to PPEs. I am not happy about this at all > and I assumed that PPEs are optional features. If they are not, it is > quite unfortunate that the so far existing extension point no longer > really is one and something has to be done at least to clarify this! > > > -- > Dr. Birte Glimm, Room 309 > Computing Laboratory > Parks Road > Oxford > OX1 3QD > United Kingdom > +44 (0)1865 283520 -- Steve Harris, CTO, Garlik Limited 1-3 Halford Road, Richmond, TW10 6AW, UK +44 20 8439 8203 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Tuesday, 15 March 2011 23:39:40 UTC