Re: Review SPARQL Query 1.1, Section 18 (algebra) from Birte Glimm on 2011-03-16 (public-rdf-dawg@w3.org from January to March 2011)

From: Birte Glimm <birte.glimm@comlab.ox.ac.uk>
Date: Wed, 16 Mar 2011 15:24:42 +0000
To: Steve Harris <steve.harris@garlik.com>
Cc: SPARQL Working Group <public-rdf-dawg@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>
Message-ID: <AANLkTinMRDU1uYrC_=R=k8Kp00Y2kNJPKXt3C+WDtAYJ@mail.gmail.com>
On 15 March 2011 23:39, Steve Harris <steve.harris@garlik.com> wrote:
> On 2011-03-15, at 00:34, Birte Glimm wrote:
[snip]

[11.2 ]
The ?a has to be there, otherwise you'd get permutations of :x and :y,
not pairs. The algebra only illustrates the Aggregate part of the
query though.

True. I overlooked that. I still don't like that you give a query and
a solution sequence (for the BGP) that not really is one. I would much
prefer to have mappings for ?a in there. You could again label the
solutions with μ_1 to μ_3 and use those labels in the algebra:
S=(μ_1, μ_2, μ_3), with μ_1 = {?a -> ..., ?x->2, ?y->3}, μ_2={?a->...,
?x->2, ?y->5}, μ_3= {?a->...,?x->6, ?y->7}, we might wish to group the
solutions according to the value of ?x, and calculate the average of
the values of ?y for each group.
This could be written as:
SELECT (AVG(?y) AS ?avg)
WHERE {
  ?a :x ?x ;
     :y ?y .
}
GROUP BY ?x
The GROUP BY function, Group((?x), S) gives G = {
  (2) -> ( μ_1, μ_2 ),
  (6) -> ( μ_3 )
}

[snip]

>> According to the grammar, the GROUP BY clause can alo contain
>> BuiltInCalls and FunctionCalls, which is not handled here and probably
>> shouldn't be allowed.
>
> I'm not sure I agree that BuiltInCalls and FunctionCalls are an issue, I think they should come out in the Group(exprlist, P) eval phase.

I think I might just have been confused by μ(expr), which should
probbaly be expr(μ) as in filters etc to denote that you evaluate expr
after appying μ to it.
Here's how I originally imagined what happens:
For example,
SELECT (MIN(?y) AS ?min)
WHERE { ?x <p> ?y } GROUP BY (STRLEN(STR(?x)))
gives:
Group(STRLEN(STR(?x)), Bgp(...))
For Group(..) we have:
Group(exprlist, Ω) = { ListEval(exprlist, μ) -> { μ' | μ' in Ω,
ListEval(exprlist, μ) = ListEval(exprlist, μ') } | μ in Ω }
so we need ListEval:
ListEval((expr1, ..., exprn), μ) returns a list (e1, ..., en), where
ei = μ(exprlisti) or error.
[Note that exprlisti atthe end should be expri]
which is
ListEval((STRLEN(STR(?x))), μ) for each μ, i.e, e1 = μ(STRLEN(STR(?x)))

So at this point I wonder what is supposed to happen and I also might
just have been confused by how expression evaluation is normally
written, e.g., for evaluating a filter, you evaluate expr(μ) meaning
that you substitute variables in expr with the values in μ and then
you evaluate. That's not precisely defined, but can be concluded from
the text.
If the same is supposed to happen for ListEval, it should be
ListEval((expr1, ..., exprn), μ) returns a list (e1, ..., en), where
ei = expri(μ) or error.
I.e., you'd have to swap μ and expri. Then we would have:
e1 = STRLEN(STR(?x))(μ), which would indeed order by the length of the
string that μ has for ?x.
For normal variables one would still want μ(expri) though.


>> I don't understand whether the outer for loop is indeed iterating over
>> all expressions in SElECT and each HAVING, each time executing the
>> inner IF and two FOR loops as the indentation suggests or whether the
>> second and third for loop run after the first one is done That's what
>> I would expect).
>
> Yes, I guess it could be the other way round - I found it clearer as an operation that loops over the expressions though.

I don't mind either way, but probably I do too much Java programming
where without any delimiters for the for loop, the loop just applies
to the next instructions. Since you also don't use any kind of "End
For", one can only guess from the indentation what is in scope for the
loop.

>>                   The first and second FOR loop should probably be
>> extended to handle SELECT * and the SAMPLE replacement is only
>> required if the variable/variable in the expression is also not
>> grouped in GROUP BY.
>
> I think * has already been expanded by this point in the execution.

18.2.4 says the order is:
# Grouping
# Aggregates
# Select expressions
# Having
And Grouping (as the first step) doesn't handle *, tha is only done
later in Select Expressions, where all in-scope variables are
collected for the projection without being sampled.

> I believe SAMPLE replacement is always needed as you can't eval() a variable over a group.
>
>> It took me a loooong time to understand agg_i. Since aggregates anyway
>> have to be aliased, wouldn't it be simpler to use the alias straight
>> away and give that even into the Aggregation(...) objects, so that
>> proper mappings can be constructed straight away? Even you you don't
>
> That was what I tried to do at first, but I think there's a scope issue that prevents it.

Hm, not sure I understand what the issue would be, but you probably
know it better than I do.

>> want to do that, I suggest to give agg_i as the first parameter into
>> Aggregation(...) since otherwise it kind of falls from the sky later
>> on. Also worth noting that each agg_i must be a fresh variable
>> distinct from all other agg_j and variable occurring in Q.
>>
>> I don't see why G is needed in AggregateJoin(A, G) since each A_i
>> already has G as parameter and G is not needed as a parameter to
>> AggregateJoin.
>
> Yes, agreed.
>
>> The example mises a brace before COUNT and the ?x should be ?a. I
>> suggest to extend the example so that also the changed Q with agg_i
>> instead f the aggregates is shown and A := (A_1, A_2) would help
>> understanding (in particular since it is now missing totally I
>> guess).
>
> I did try and write out the changed Q, but it was if anything more confusing as it mixes syntax and algebra.

Could be. I just was quite confused at that point already due to the A
/ A_i and agg_i confusion, so maybe when that is clarified, it already
gets easier to follow.

>> For the example following the Group Definition, I suggest to first
>> give the query and it might be more helpful to use a unary aggregate
>> as all the standard ones defined by SPARQL are unary.
>
> Well, without that n-ary aggregate there's no example anywhere that covers why it's a list of lists.

Ok, I see your point. I nevertheless would like to see one complete
aggregation example including the complete evaluation, which only
really seems sensible for the standard aggregates such as MIN or SUM.
Maybe we could have one complete example plus the n-ary one as it is?

>> Definition: Sample
>> I suggest to say already in the definition that v is *an arbitrary*
>> element from Flatten(M).
>
> I think this is sufficiently clear, but if others agree I'll add another para about it.

It doesn't have to be a paragraph, just *an arbitrary* inserted in the
definition, so that the def is clear without reading the text below.
It's not essential though.

Best regards,
Birte

-- 
Dr. Birte Glimm, Room 309
Computing Laboratory
Parks Road
Oxford
OX1 3QD
United Kingdom
+44 (0)1865 283520
Received on Wednesday, 16 March 2011 15:26:11 UTC