Re: Query Review Part 2 from Andy Seaborne on 2011-12-15 (public-rdf-dawg@w3.org from October to December 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Thu, 15 Dec 2011 11:30:56 +0000
To: Steve Harris <steve.harris@garlik.com>, birte.glimm@uni-ulm.de
CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4EE9DA70.4080906@epimorphics.com>
One @@Steve.

Unrelated:

I also changed:

-----------------
Avg(M) = 0, where Count(M) = 0
   ==>
Avg(M) = "0"^^xsd:integer, where Count(M) = 0
-----------------

On 14/12/11 12:08, Steve Harris wrote:
> Thanks for the review Birte, comments inline.
>
> On 2011-12-13, at 19:28, Birte Glimm wrote:
>
>> Hi Andy, Steve, all,
>>
>> here comes the second part of my Query review for Sec. 18 onwards.
>>
>> Best regards,
>>
>> Birte
>>
>> 18.2.2.2 uses translate(...) whereas all later transformation
>> algorithms use Transform(...) is that intended or should this rather
>> be unified?

Changed to "Translate" (that's the section naming usage) in pseudocode 
blocks.

>> 18.2.4.1 Grouping and Aggregation, currently says:
>> For each (X AS Var) in SELECT, each HAVING(X), and each ORDER BY X in Q
>>   If X contains an unaggregated variable V
>>       Replace V with Sample(V)
>>       End
>> I wonder whether we should replace the If statement with a For loop,
>> to better handle the cse where we have multiple unaggregated variables
>> in X, e.g., ((?x + ?y) AS ?z) with ?x and ?y unaggregated requires
>> sampling them both:
>> For each (X AS Var) in SELECT, each HAVING(X), and each ORDER BY X in Q
>>   For each unaggregated variable V in X
>>       Replace V with Sample(V)
>>       End
>
> Yes, agreed, that's much clearer. Fixed.
>
>> 18.2.4.4: Note, E s a list of pairs of the form (variable,
>> expression), defined in section 18.2.4
>> E *i*s a list…
>
> Fixed.
>
>> 18.4
>> This might not come as a surprise, but I am (and always was) unhappy
>> about the exist filters. In particular, we have in 18.4 a definition
>> of Filter that has as third parameter the active graph, but this is
>> not used in the definition. Instead, there is a kind of footnote
>> pointing to the evaluation semantics for exists(pattern) filters.
>> Either we have a function Filter that is defined for all SPARQL
>> filters or we don't. I suppose we should have such a function. The
>> problem is, that at this point the evaluation semantics can no longer
>> be separated from the algebra functions. If we were honest here, we
>> would define Filter somehow like this:
>> Definition: Filter
>> Let Ω be a multiset of solution mappings, expr an expression, and G an
>> RDF graph.
>> Filter(expr, Ω, G) = { μ | μ in Ω and expr(μ) is an expression that
>> has an effective boolean value of true given that each subexpression
>> of expr of the form exists(A) evaluates to true if and only if
>> eval(D(G), substitute(A, μ)) is a non-empty sequence }

We could define things in that style.  At this stage, I'm inclined to 
leave the doc as-is unless there is a specific problem.  A change in 
style would have to be propagated more widely than just here.

(more than this would be editorial so we can come back to it post-LC2)

http://www.w3.org/2009/sparql/wiki/PostLastCall2

>> There is no problem with using substitute here, which I would anyway
>> move to 18.4, but having to delegate to eval here is not nice. I don't
>> see a solution for this, but prefer a more honest definition here that
>> shows that we now mix algebra functions with evaluation semantics.
>> If we change to the above definition, substitute moves to 18.4 and
>> Definition: Evaluation of Exists is obsolete since it is covered by
>> the filter definition.
>>
>> Definition: Extend
>> ...
>> Extend(Ω , var, term) = { Extend(μ, var, term) | μ in Ω }
>> I assume term should be expr as in the lines before. If, for some
>> reason beyond my understanding, term is meant, it should be defined
>> above (as mu, Omega, var and expr), I guess as an RDF term.

Something has to carry the expression from the translation stage to the 
evaluation.  We can't say Extend(?? , var, term) because the expression 
itself would be lost.

I have changed:  term => expr in definition of Extend(Ω, ...)

(see below for more on expr(μ))

(IF I were redoing the definition of SPARQL, given the new material 
we've added, especially aggregates, I'd consider making the whole thing 
explicitly based on symbol manipulation but it was too late to do that 
at the start of this WG before the new material arrived.)

>> 18.4.1
>> Note that in 18.2.4.1, we get Group((1), P) if the query has an
>> aggregate, but no GROUP BY clause. Thus, Group((1), P) should produce
>> something meaningful in 18.4.1, but we define
>> Group(exprlist, Ω) = { ListEval(exprlist, μ) ->  { μ' | μ' in Ω,
>> ListEval(exprlist, μ) = ListEval(exprlist, μ') } | μ in Ω }
>> which is
>> Group((1), Ω) = { ListEval((1), μ) ->  { μ' | μ' in Ω, ListEval((1), μ)
>> = ListEval((1), μ') } | μ in Ω }
>> with Ω the solutions for P and
>> ListEvalListEval((expr1, ..., exprn), μ) returns a list (e1, ..., en),
>> where ei = μ(expri) or error.
>> which is
>> ListEvalListEval((1), μ) returns a list (e1), where e1 = μ(1) or error.
>> What is μ(1) here? I don't think implicit grouping is mean to work
>> based on errors. All in all, I don't understand how implicit grouping
>> is supposed to work.
>
> Yes, I see. I'm not particularly happy with the notation of μ(expr) used to denote evaluating expr w.r.t. μ. Previously I used •, as in μ • expr but that didn't work for some people.
>
> Note, this isn't specifically an issue for implicit grouping, it applies on any grouping with only constants, though those would be odd use cases, e.g. GROUP BY 23, or GROUP BY NOW() will end up in the same place.
>
> I don't really have a sensible suggestion for how to fix this, the best I can think of is some explanatory text saying what μ(expr) is trying to express.

@@Steve

μ is a partial mapping of variables to values.

μ(expr) would be the a substitution of the variables for the terms given 
by μ.  But not the value of the expression.

So with μ being ?x=1 and expr being ?x+2 then μ(expr) is the expression 
"1+2" but nothing causes a call to "+" to turn that into 3 (a value).

If you want the value of expr_i given μ, then you need something like:
    expr_i(μ)

which is used elsewhere (from 1.0 days).

I have explicitly stated that "expr(μ)" is the value of the expr given 
that solution mapping μ in "18.1.8 Solution Mapping"

>> Definition: Aggregation
>> Aggregation, a function which calculates a scalar value as an output
>> of the aggregate expression in the SELECT clause, in the HAVING
>> evaluation process, and in ORDER BY where required is used to
>> calculate aggregated values over groups of solutions.
>> I don't understand "where required is used to calculate aggregated
>> values over groups of solutions."
>
> Yes, that's messed up grammar. Maybe:
>
> "Aggregation, a function which calculates a scalar value as an
> output
> of the aggregate expression. It is used in the SELECT clause, the HAVING
> evaluation process, and in ORDER BY (where required). Aggregation
> calculates aggregated values over groups of solutions, using set functions."
>
>> The main part of the definition is still in 18.5 and, to be consistent
>> with the rest, should be moved to 18.4.1. For example, in 18.4.1, you
>> could have the def. from 18.5 with slight modifications:
>
> OK, I was scared of moving the definition from evaluation of… to def'n of… incase it altered the semantics.
>
>> Definition: Aggregation
>> Let exprlist be a list of expressions or *, func an aggregate,
>> scalarvals a set of partial functions (possibly empty) passed from the
>> aggregate in the query, and let { key_1->Omega_1, ..., key_m->Omega_m
>> } be a multiset of partial functions from keys to solution sequences
>> as produced by the grouping step. Aggregation applies the set function
>> func to the given multiset and produces a single value for each key
>> and partition of solutions for that key.
>> Aggregation( exprlist, func, scalarvals, { key_1->Omega_1, ...,
>> key_m->Omega_m } )
>> = { (dom(g), F) | g in { key_1->Omega_1, ..., key_m->Omega_m } }
>> where
>>    M = { ListEval(exprlist, μ) | μ in range(g) }
>>    F = func(M, scalarvals), for non-DISTINCT
>>    F = func(Distinct(M), scalarvals), for DISTINCT
>> Special Case: when COUNT is used with the expression * the value of F
>> will be the cardinality of the group solution sequence,
>> card[range(g)], or card[Distinct(range(g))] if the DISTINCT keyword is
>> present.
>
> Thanks, I have made that edit, can you check I lost nothing in translation? I changed "func an aggregate" to "func a set function", which I believe is correct at this point in the process.
>
>> I would take the following outside of the definition:
>> "scalarvals" are used to pass values to the underlying set function,
>> bypassing the mechanics of the grouping. For example, the aggregate
>> expression GROUP_CONCAT(?x ; separator="|") has a scalarvals argument
>> of { "separator" ->  "|" }. All aggregates may have the DISTINCT
>> keyword as the first token in their argument list. If this keyword is
>> present then first argument to func is Distinct(M).
>>
>> Note that I have replaced eval(D(G), P) with a more explicit data
>> structure. We can, anyway, not work with eval(...) in 18.4, but have
>> to define the function for a certain given data structure. I would
>> even suggest to go further and define this more explicitly (using keys
>> and the solution sets Omega).
>> Aggregation( exprlist, func, scalarvals, { key_1->Omega_1, ...,
>> key_m->Omega_m } )
>> = { (key, F(Omega)) | key->Omega in { key_1->Omega_1, ..., key_m->Omega_m } }
>> where
>>    M(Omega) = { ListEval(exprlist, μ) | μ in Omega }
>>    F(Omega) = func(M(Omega), scalarvals), for non-DISTINCT
>>    F(Omega) = func(Distinct(M(Omega)), scalarvals), for DISTINCT
>
> Yes, that is clearer, changed.
>
>> I always found the g quite confusing in the current definition.
>> In 18.5 we then just have:
>> Definition: Evaluation of Aggregation
>> eval(D(G), Aggregation(exprlist, func, scalarvals, P)) =
>> Aggregation(exprlist, func, scalarvals, eval(D(G), P))
>>
>> I would also fully define AggregateJoin in 18.4.1, i.e.:
>> Definition: AggregateJoin
>> Let S_1, ..., S_n be a list of sets, where each set S_i contains key
>> to (aggregated) value maps as produced by Aggregate. Let K = { key |
>> key in dom(S_j) for some 1<= j<= n } be the set of keys, then
>> AggregateJoin(S_1, ..., S_n) = { agg_1->val_1, ..., agg_n->val_n | key
>> in K and key->val_i in S_i for each 1<= i<= n }
>>
>> I also define the set of keys beforehand since I find it easier to see
>> that each constructed mappings results from one key.
>>
>> In 18.5 you can then define
>> Definition: Evaluation of AggregateJoin
>> eval(D(G), AggregateJoin(A_1, ..., A_n)) = AggregateJoin(eval(D(G),
>> A_1), ..., eval(D(G), A_n))
>>
>> This way, aggregation is hadled as all the other functions, i.e.,
>> define an abstract function to which eval delegates.
>
> Yes, that's better, thanks.
>
>> 18.4.1.1 Set Functions
>> The only use of this that's supported by the built-in aggregates
>> I would spell out that is instead of that's
>
> Done.
>
> Thanks,
>     Steve
>
>> I find the definition names in 18.5 not very uniform. We have (picking some):
>> Definition: Evaluation of Filter(F, P)
>> Definition: Evaluation of Exists
>> Definition: Evaluation of LeftJoin(P1, P2, F)
>> Definition: Evaluation of a Union Pattern
>>
>> Why not Evaluation of a Union(P1, P2) or just Evaluation of LeftJoin?

Argument removed from title.
Made the naming more systematic.

>> 18.6
>> As Axel pointed out that the first condition for ent. regimes should
>> use ***E-***consistent and we also extended this part to say that the
>> scoping graph is uniquely specified *up to RDF graph equivalence*.
>> Thus Axel suggest to use in both Query and Ent. Regimes:
>> 1. The scoping graph, SG, corresponding to any E-consistent active
>> graph AG is uniquely specified up to<a
>> href="http://www.w3.org/TR/rdf-concepts/#section-graph-equality">RDF
>> graph equivalence</a>  and is E-equivalent to AG.
>> I am happy to use this in Ent. Regimes.

I have already made the first change Axel suggested ("consistent" -> 
"E-consistent").

If closer synchronisation of the two texts is needed, it's "editorial".

 Andy

>>
>>
>>
>> --
>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>> University of Ulm                         Fax:   +49 731 50 24188
>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>> Germany
>>
>
Received on Thursday, 15 December 2011 11:31:36 UTC