Query Review Part 2 from Birte Glimm on 2011-12-13 (public-rdf-dawg@w3.org from October to December 2011)

From: Birte Glimm <birte.glimm@uni-ulm.de>
Date: Tue, 13 Dec 2011 20:28:57 +0100
To: Andy Seaborne <andy.seaborne@epimorphics.com>, Steve Harris <steve.harris@garlik.com>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <CABt65OdXkkQ0XmzR86F=HP-sFrFL8aYu9WRhN3TQo=ty-zCH=g@mail.gmail.com>
Hi Andy, Steve, all,

here comes the second part of my Query review for Sec. 18 onwards.

Best regards,

Birte

18.2.2.2 uses translate(...) whereas all later transformation
algorithms use Transform(...) is that intended or should this rather
be unified?

18.2.4.1 Grouping and Aggregation, currently says:
For each (X AS Var) in SELECT, each HAVING(X), and each ORDER BY X in Q
  If X contains an unaggregated variable V
      Replace V with Sample(V)
      End
I wonder whether we should replace the If statement with a For loop,
to better handle the cse where we have multiple unaggregated variables
in X, e.g., ((?x + ?y) AS ?z) with ?x and ?y unaggregated requires
sampling them both:
For each (X AS Var) in SELECT, each HAVING(X), and each ORDER BY X in Q
  For each unaggregated variable V in X
      Replace V with Sample(V)
      End

18.2.4.4: Note, E s a list of pairs of the form (variable,
expression), defined in section 18.2.4
E *i*s a list...


18.4
This might not come as a surprise, but I am (and always was) unhappy
about the exist filters. In particular, we have in 18.4 a definition
of Filter that has as third parameter the active graph, but this is
not used in the definition. Instead, there is a kind of footnote
pointing to the evaluation semantics for exists(pattern) filters.
Either we have a function Filter that is defined for all SPARQL
filters or we don't. I suppose we should have such a function. The
problem is, that at this point the evaluation semantics can no longer
be separated from the algebra functions. If we were honest here, we
would define Filter somehow like this:
Definition: Filter
Let Ω be a multiset of solution mappings, expr an expression, and G an
RDF graph.
Filter(expr, Ω, G) = { μ | μ in Ω and expr(μ) is an expression that
has an effective boolean value of true given that each subexpression
of expr of the form exists(A) evaluates to true if and only if
eval(D(G), substitute(A, μ)) is a non-empty sequence }

There is no problem with using substitute here, which I would anyway
move to 18.4, but having to delegate to eval here is not nice. I don't
see a solution for this, but prefer a more honest definition here that
shows that we now mix algebra functions with evaluation semantics.
If we change to the above definition, substitute moves to 18.4 and
Definition: Evaluation of Exists is obsolete since it is covered by
the filter definition.

Definition: Extend
...
Extend(Ω , var, term) = { Extend(μ, var, term) | μ in Ω }
I assume term should be expr as in the lines before. If, for some
reason beyond my understanding, term is meant, it should be defined
above (as mu, Omega, var and expr), I guess as an RDF term.

18.4.1
Note that in 18.2.4.1, we get Group((1), P) if the query has an
aggregate, but no GROUP BY clause. Thus, Group((1), P) should produce
something meaningful in 18.4.1, but we define
Group(exprlist, Ω) = { ListEval(exprlist, μ) -> { μ' | μ' in Ω,
ListEval(exprlist, μ) = ListEval(exprlist, μ') } | μ in Ω }
which is
Group((1), Ω) = { ListEval((1), μ) -> { μ' | μ' in Ω, ListEval((1), μ)
= ListEval((1), μ') } | μ in Ω }
with Ω the solutions for P and
ListEvalListEval((expr1, ..., exprn), μ) returns a list (e1, ..., en),
where ei = μ(expri) or error.
which is
ListEvalListEval((1), μ) returns a list (e1), where e1 = μ(1) or error.
What is μ(1) here? I don't think implicit grouping is mean to work
based on errors. All in all, I don't understand how implicit grouping
is supposed to work.

Definition: Aggregation
Aggregation, a function which calculates a scalar value as an output
of the aggregate expression in the SELECT clause, in the HAVING
evaluation process, and in ORDER BY where required is used to
calculate aggregated values over groups of solutions.
I don't understand "where required is used to calculate aggregated
values over groups of solutions."
The main part of the definition is still in 18.5 and, to be consistent
with the rest, should be moved to 18.4.1. For example, in 18.4.1, you
could have the def. from 18.5 with slight modifications:
Definition: Aggregation
Let exprlist be a list of expressions or *, func an aggregate,
scalarvals a set of partial functions (possibly empty) passed from the
aggregate in the query, and let { key_1->Omega_1, ..., key_m->Omega_m
} be a multiset of partial functions from keys to solution sequences
as produced by the grouping step. Aggregation applies the set function
func to the given multiset and produces a single value for each key
and partition of solutions for that key.
Aggregation( exprlist, func, scalarvals, { key_1->Omega_1, ...,
key_m->Omega_m } )
= { (dom(g), F) | g in { key_1->Omega_1, ..., key_m->Omega_m } }
where
   M = { ListEval(exprlist, μ) | μ in range(g) }
   F = func(M, scalarvals), for non-DISTINCT
   F = func(Distinct(M), scalarvals), for DISTINCT
Special Case: when COUNT is used with the expression * the value of F
will be the cardinality of the group solution sequence,
card[range(g)], or card[Distinct(range(g))] if the DISTINCT keyword is
present.

I would take the following outside of the definition:
"scalarvals" are used to pass values to the underlying set function,
bypassing the mechanics of the grouping. For example, the aggregate
expression GROUP_CONCAT(?x ; separator="|") has a scalarvals argument
of { "separator" -> "|" }. All aggregates may have the DISTINCT
keyword as the first token in their argument list. If this keyword is
present then first argument to func is Distinct(M).

Note that I have replaced eval(D(G), P) with a more explicit data
structure. We can, anyway, not work with eval(...) in 18.4, but have
to define the function for a certain given data structure. I would
even suggest to go further and define this more explicitly (using keys
and the solution sets Omega).
Aggregation( exprlist, func, scalarvals, { key_1->Omega_1, ...,
key_m->Omega_m } )
= { (key, F(Omega)) | key->Omega in { key_1->Omega_1, ..., key_m->Omega_m } }
where
   M(Omega) = { ListEval(exprlist, μ) | μ in Omega }
   F(Omega) = func(M(Omega), scalarvals), for non-DISTINCT
   F(Omega) = func(Distinct(M(Omega)), scalarvals), for DISTINCT

I always found the g quite confusing in the current definition.
In 18.5 we then just have:
Definition: Evaluation of Aggregation
eval(D(G), Aggregation(exprlist, func, scalarvals, P)) =
Aggregation(exprlist, func, scalarvals, eval(D(G), P))

I would also fully define AggregateJoin in 18.4.1, i.e.:
Definition: AggregateJoin
Let S_1, ..., S_n be a list of sets, where each set S_i contains key
to (aggregated) value maps as produced by Aggregate. Let K = { key |
key in dom(S_j) for some 1 <= j <= n } be the set of keys, then
AggregateJoin(S_1, ..., S_n) = { agg_1->val_1, ..., agg_n->val_n | key
in K and key->val_i in S_i for each 1 <= i <= n }

I also define the set of keys beforehand since I find it easier to see
that each constructed mappings results from one key.

In 18.5 you can then define
Definition: Evaluation of AggregateJoin
eval(D(G), AggregateJoin(A_1, ..., A_n)) = AggregateJoin(eval(D(G),
A_1), ..., eval(D(G), A_n))

This way, aggregation is hadled as all the other functions, i.e.,
define an abstract function to which eval delegates.

18.4.1.1 Set Functions
The only use of this that's supported by the built-in aggregates
I would spell out that is instead of that's

I find the definition names in 18.5 not very uniform. We have (picking some):
Definition: Evaluation of Filter(F, P)
Definition: Evaluation of Exists
Definition: Evaluation of LeftJoin(P1, P2, F)
Definition: Evaluation of a Union Pattern

Why not Evaluation of a Union(P1, P2) or just Evaluation of LeftJoin?

18.6
As Axel pointed out that the first condition for ent. regimes should
use ***E-***consistent and we also extended this part to say that the
scoping graph is uniquely specified *up to RDF graph equivalence*.
Thus Axel suggest to use in both Query and Ent. Regimes:
1. The scoping graph, SG, corresponding to any E-consistent active
graph AG is uniquely specified up to <a
href="http://www.w3.org/TR/rdf-concepts/#section-graph-equality">RDF
graph equivalence</a> and is E-equivalent to AG.
I am happy to use this in Ent. Regimes.



-- 
Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
University of Ulm                         Fax:   +49 731 50 24188
D-89069 Ulm                               birte.glimm@uni-ulm.de
Germany
Received on Tuesday, 13 December 2011 19:32:00 UTC