Re: Query Review Part 2 from Steve Harris on 2011-12-14 (public-rdf-dawg@w3.org from October to December 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Wed, 14 Dec 2011 12:08:06 +0000
To: birte.glimm@uni-ulm.de
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <F16D0A3D-595C-432B-8B05-AAC9ACF54CF1@garlik.com>
Thanks for the review Birte, comments inline.

On 2011-12-13, at 19:28, Birte Glimm wrote:

> Hi Andy, Steve, all,
> 
> here comes the second part of my Query review for Sec. 18 onwards.
> 
> Best regards,
> 
> Birte
> 
> 18.2.2.2 uses translate(...) whereas all later transformation
> algorithms use Transform(...) is that intended or should this rather
> be unified?
> 
> 18.2.4.1 Grouping and Aggregation, currently says:
> For each (X AS Var) in SELECT, each HAVING(X), and each ORDER BY X in Q
>  If X contains an unaggregated variable V
>      Replace V with Sample(V)
>      End
> I wonder whether we should replace the If statement with a For loop,
> to better handle the cse where we have multiple unaggregated variables
> in X, e.g., ((?x + ?y) AS ?z) with ?x and ?y unaggregated requires
> sampling them both:
> For each (X AS Var) in SELECT, each HAVING(X), and each ORDER BY X in Q
>  For each unaggregated variable V in X
>      Replace V with Sample(V)
>      End

Yes, agreed, that's much clearer. Fixed.

> 18.2.4.4: Note, E s a list of pairs of the form (variable,
> expression), defined in section 18.2.4
> E *i*s a list…

Fixed.

> 18.4
> This might not come as a surprise, but I am (and always was) unhappy
> about the exist filters. In particular, we have in 18.4 a definition
> of Filter that has as third parameter the active graph, but this is
> not used in the definition. Instead, there is a kind of footnote
> pointing to the evaluation semantics for exists(pattern) filters.
> Either we have a function Filter that is defined for all SPARQL
> filters or we don't. I suppose we should have such a function. The
> problem is, that at this point the evaluation semantics can no longer
> be separated from the algebra functions. If we were honest here, we
> would define Filter somehow like this:
> Definition: Filter
> Let Ω be a multiset of solution mappings, expr an expression, and G an
> RDF graph.
> Filter(expr, Ω, G) = { μ | μ in Ω and expr(μ) is an expression that
> has an effective boolean value of true given that each subexpression
> of expr of the form exists(A) evaluates to true if and only if
> eval(D(G), substitute(A, μ)) is a non-empty sequence }
> 
> There is no problem with using substitute here, which I would anyway
> move to 18.4, but having to delegate to eval here is not nice. I don't
> see a solution for this, but prefer a more honest definition here that
> shows that we now mix algebra functions with evaluation semantics.
> If we change to the above definition, substitute moves to 18.4 and
> Definition: Evaluation of Exists is obsolete since it is covered by
> the filter definition.
> 
> Definition: Extend
> ...
> Extend(Ω , var, term) = { Extend(μ, var, term) | μ in Ω }
> I assume term should be expr as in the lines before. If, for some
> reason beyond my understanding, term is meant, it should be defined
> above (as mu, Omega, var and expr), I guess as an RDF term.
> 
> 18.4.1
> Note that in 18.2.4.1, we get Group((1), P) if the query has an
> aggregate, but no GROUP BY clause. Thus, Group((1), P) should produce
> something meaningful in 18.4.1, but we define
> Group(exprlist, Ω) = { ListEval(exprlist, μ) -> { μ' | μ' in Ω,
> ListEval(exprlist, μ) = ListEval(exprlist, μ') } | μ in Ω }
> which is
> Group((1), Ω) = { ListEval((1), μ) -> { μ' | μ' in Ω, ListEval((1), μ)
> = ListEval((1), μ') } | μ in Ω }
> with Ω the solutions for P and
> ListEvalListEval((expr1, ..., exprn), μ) returns a list (e1, ..., en),
> where ei = μ(expri) or error.
> which is
> ListEvalListEval((1), μ) returns a list (e1), where e1 = μ(1) or error.
> What is μ(1) here? I don't think implicit grouping is mean to work
> based on errors. All in all, I don't understand how implicit grouping
> is supposed to work.

Yes, I see. I'm not particularly happy with the notation of μ(expr) used to denote evaluating expr w.r.t. μ. Previously I used •, as in μ • expr but that didn't work for some people.

Note, this isn't specifically an issue for implicit grouping, it applies on any grouping with only constants, though those would be odd use cases, e.g. GROUP BY 23, or GROUP BY NOW() will end up in the same place.

I don't really have a sensible suggestion for how to fix this, the best I can think of is some explanatory text saying what μ(expr) is trying to express.

> Definition: Aggregation
> Aggregation, a function which calculates a scalar value as an output
> of the aggregate expression in the SELECT clause, in the HAVING
> evaluation process, and in ORDER BY where required is used to
> calculate aggregated values over groups of solutions.
> I don't understand "where required is used to calculate aggregated
> values over groups of solutions."

Yes, that's messed up grammar. Maybe:

"Aggregation, a function which calculates a scalar value as an output of the aggregate expression. It is used  in the SELECT clause, the HAVING evaluation process, and in ORDER BY (where required). Aggregation calculates aggregated values over groups of solutions, using set functions."

> The main part of the definition is still in 18.5 and, to be consistent
> with the rest, should be moved to 18.4.1. For example, in 18.4.1, you
> could have the def. from 18.5 with slight modifications:

OK, I was scared of moving the definition from evaluation of… to def'n of… incase it altered the semantics.

> Definition: Aggregation
> Let exprlist be a list of expressions or *, func an aggregate,
> scalarvals a set of partial functions (possibly empty) passed from the
> aggregate in the query, and let { key_1->Omega_1, ..., key_m->Omega_m
> } be a multiset of partial functions from keys to solution sequences
> as produced by the grouping step. Aggregation applies the set function
> func to the given multiset and produces a single value for each key
> and partition of solutions for that key.
> Aggregation( exprlist, func, scalarvals, { key_1->Omega_1, ...,
> key_m->Omega_m } )
> = { (dom(g), F) | g in { key_1->Omega_1, ..., key_m->Omega_m } }
> where
>   M = { ListEval(exprlist, μ) | μ in range(g) }
>   F = func(M, scalarvals), for non-DISTINCT
>   F = func(Distinct(M), scalarvals), for DISTINCT
> Special Case: when COUNT is used with the expression * the value of F
> will be the cardinality of the group solution sequence,
> card[range(g)], or card[Distinct(range(g))] if the DISTINCT keyword is
> present.

Thanks, I have made that edit, can you check I lost nothing in translation? I changed "func an aggregate" to "func a set function", which I believe is correct at this point in the process.

> I would take the following outside of the definition:
> "scalarvals" are used to pass values to the underlying set function,
> bypassing the mechanics of the grouping. For example, the aggregate
> expression GROUP_CONCAT(?x ; separator="|") has a scalarvals argument
> of { "separator" -> "|" }. All aggregates may have the DISTINCT
> keyword as the first token in their argument list. If this keyword is
> present then first argument to func is Distinct(M).
> 
> Note that I have replaced eval(D(G), P) with a more explicit data
> structure. We can, anyway, not work with eval(...) in 18.4, but have
> to define the function for a certain given data structure. I would
> even suggest to go further and define this more explicitly (using keys
> and the solution sets Omega).
> Aggregation( exprlist, func, scalarvals, { key_1->Omega_1, ...,
> key_m->Omega_m } )
> = { (key, F(Omega)) | key->Omega in { key_1->Omega_1, ..., key_m->Omega_m } }
> where
>   M(Omega) = { ListEval(exprlist, μ) | μ in Omega }
>   F(Omega) = func(M(Omega), scalarvals), for non-DISTINCT
>   F(Omega) = func(Distinct(M(Omega)), scalarvals), for DISTINCT

Yes, that is clearer, changed.

> I always found the g quite confusing in the current definition.
> In 18.5 we then just have:
> Definition: Evaluation of Aggregation
> eval(D(G), Aggregation(exprlist, func, scalarvals, P)) =
> Aggregation(exprlist, func, scalarvals, eval(D(G), P))
> 
> I would also fully define AggregateJoin in 18.4.1, i.e.:
> Definition: AggregateJoin
> Let S_1, ..., S_n be a list of sets, where each set S_i contains key
> to (aggregated) value maps as produced by Aggregate. Let K = { key |
> key in dom(S_j) for some 1 <= j <= n } be the set of keys, then
> AggregateJoin(S_1, ..., S_n) = { agg_1->val_1, ..., agg_n->val_n | key
> in K and key->val_i in S_i for each 1 <= i <= n }
> 
> I also define the set of keys beforehand since I find it easier to see
> that each constructed mappings results from one key.
> 
> In 18.5 you can then define
> Definition: Evaluation of AggregateJoin
> eval(D(G), AggregateJoin(A_1, ..., A_n)) = AggregateJoin(eval(D(G),
> A_1), ..., eval(D(G), A_n))
> 
> This way, aggregation is hadled as all the other functions, i.e.,
> define an abstract function to which eval delegates.

Yes, that's better, thanks.

> 18.4.1.1 Set Functions
> The only use of this that's supported by the built-in aggregates
> I would spell out that is instead of that's

Done.

Thanks,
   Steve

> I find the definition names in 18.5 not very uniform. We have (picking some):
> Definition: Evaluation of Filter(F, P)
> Definition: Evaluation of Exists
> Definition: Evaluation of LeftJoin(P1, P2, F)
> Definition: Evaluation of a Union Pattern
> 
> Why not Evaluation of a Union(P1, P2) or just Evaluation of LeftJoin?
> 
> 18.6
> As Axel pointed out that the first condition for ent. regimes should
> use ***E-***consistent and we also extended this part to say that the
> scoping graph is uniquely specified *up to RDF graph equivalence*.
> Thus Axel suggest to use in both Query and Ent. Regimes:
> 1. The scoping graph, SG, corresponding to any E-consistent active
> graph AG is uniquely specified up to <a
> href="http://www.w3.org/TR/rdf-concepts/#section-graph-equality">RDF
> graph equivalence</a> and is E-equivalent to AG.
> I am happy to use this in Ent. Regimes.
> 
> 
> 
> -- 
> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
> University of Ulm                         Fax:   +49 731 50 24188
> D-89069 Ulm                               birte.glimm@uni-ulm.de
> Germany
> 

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Wednesday, 14 December 2011 12:08:37 UTC