Review SPARQL Query 1.1, Section 18 (algebra) from Birte Glimm on 2011-03-15 (public-rdf-dawg@w3.org from January to March 2011)

From: Birte Glimm <birte.glimm@comlab.ox.ac.uk>
Date: Tue, 15 Mar 2011 00:34:34 +0000
To: SPARQL Working Group <public-rdf-dawg@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>, Steve Harris <steve.harris@garlik.com>
Message-ID: <AANLkTimDQJ0ZKhWky1yGg9Pk1UjxqiTNoD3Zw3RiAqZ=@mail.gmail.com>
Hi Andy, Steve, all,
I finally managed to complete the review for the algebra section,
which took me quite a long time and I have quite a number of problems
with it and/or suggestions for improvement, see below.
Best regards,
Birte

First some comments about non-algebra sections that I noticed while
going back and forth between the algebra and the examples in the
previous sections:

11.2 The example has an unaliased aggregate, but in 11.1 it says that
"It should be noted that as per functions, aggregate expressions must
be aliased in order to project them from queries or subqueries." In
that sentence the must should be marked as RFC 2119 term:
<rfc2119>MUST</rfc2119>.
Furthermore, the variable ?a suddenly disappears in the provided
solution sequence, so probably ?a was meant to be :a (it is not used
anywhere anyway).

17.4.1.4 The algebra operations in this section should be moved to
Section 18, where it should be clarified how these algebra operations
are introduced in the translation from the abstract syntax to algebra
objects. I still have a major concern here and that is really a no-go
issue IMO: The pattern in FILTER [NOT] EXISTS pattern is not
translated into algebra objects. Thus, the Substitute and Exists
functions are not even used yet.
Lets consider the example from 8.3.3:
SELECT * WHERE {
        ?a :p ?n
        FILTER NOT EXISTS {
                ?a :q ?m FILTER(?n = ?m)
        }
}
The query pattern translates to:
Filter(
  fn:not(EXISTS { ?a :q ?m FILTER(?n = ?m) } ),
  Bgp(?a :p ?n)
)
The first BGP is properly translated into the Bgp(...) algebra
operator, whereas the other one isn't. IMO, we should get something like:
Filter(
  fn:not( Exists( Filter(?n = ?m, Bgp(?a :q ?m)) ) ),
  Bgp(?a :p ?n)
)
However, even if we get that, the dataset and the active graph is not
being handed through to the filter evaluation, whereas here the active
graph magically appears in the evaluation of exists. On that note: In
the evaluation of Exists, the active graph is the second argument,
whereas everywhere else it is the first argument of eval and only
there is is written D[g] instead of D(G). More consistent would be:
exists(pattern, μ) = true if and only if eval(D(G),
substitute(pattern, μ)) has any solutions.
That only has the problem that the dataset and the active graph is not
available to the filter expressions.

Even if we properly translate the pattern in the filter, applying the
substitution now has to substitute within algebra objects, but
otherwise the only other option I see is to do an algebra translation
each time EXISTS is executed. Neither option is clearly specified IMO
yet and at some point an algebra translation has to occur since the
pattern in the [NOT] EXISTS filter can be any kind of complex pattern
possibly even incl. sub-queries.

I also believe the Substitute should be defined such that it only
applies to terms in the domain of μ. E.g.,
substitute(pattern, μ) = the pattern formed by replacing every
occurrence of a variable v in pattern for which μ is defined by μ(v).


OK, now to the really reviewed section:

In general, there is quite a mix of different pseudo code style:
IF ... THEN
...
FI
If ... Then
...
End If
If ... Then
...
(no end for the if)
If ... (no then)
...
Totally mixed use of upper and lower case and even mix of styles within
one pseudo code section. This should really be unified.

18.1.1: IRIs are a subset of RDF URI References that omit (no s I
believe) the use of spaces.

18.1.7 Property Path (no s) Patterns
*A* Property Path is a path in *a* graph G if each ti is a triple of
G.

Definition: Property Path Expression
A property path expression is an expression used to match properties
in a graph formed after translation of the path syntax as defined
<link>above</link>.

The link just links to that very definition. I also assume you mean
below (18.2.2.2)? Even then, this definition is not clear to me. I
would define a Property Path Expression (PPE) inductively as follows:
iri, !(iri_1|...|iri_n), !(^iri_1|...|^iri_n),
!(iri_1|...|iri_i|^iri_i+1|...|^iri_n)
are PPEs and, for P, P1, and P2 PPEs:
^P, !P, P_1 | P_2, P_1 / P_2, ...
are PPEs.

Also note that Section 9.1 is called Property Path expressions, but
then the syntax forms are based on path elements. What a PPE really is
becomes only clear from the definition of Property Path Pattern (PPP),
because there one can see that PPEs are what occurs in the predicate
position. Before, I wasn't sure whether a PPE is actually a triple
pattern extended to allow for path elements as predicates. I suggest to
unify notation and call path element in 9.1 PPE. In 18.1.7, I suggest
to first define PPEs properly and listing the possibilities
explicitly, then define Property Path Pattern, and then Property Path
not as a sequence of triples, but as a sequence of PPPs.

18.2 The SPARQL query string is parsed and the abbreviations for IRIs
and triple patterns given in *a* (delete) section 4 are applied.

The table for the syntax tree misses BINDINGS as solution modifier and
Extends should be Extend (no s).

18.2.1 We define a variable to be in-scope if there is a way for a
variable to be in the domain *of* a solution mapping...

In-scope definition:
I suggest to be specific and use "Property Path Expression (PPP)" and
"v occurs in PPP" for the second entry.
OPTIONAL{P} misses a space: OPTIONAL {P}
Below the table: In SELECT expressions, the variable may *be* use*d*
in an expression later in the same SELECT clause and may not be *be* (duplicate)
assigned again in the same SELECT clause. May and may not could/should
be marked as RFC 2119 terms. I would even tend to say MUST NOT be
assigned again.

18.2.2.2: Notes: ...The order of forms URI and ^URI in negated
property sets is not relevant.
URI should be IRI

The parsing step interprete*s* (not interpreted) triple patterns as
property paths of length 1.

I suggest the notation of the table is unified with the earlier
example section, where you use elt instead of path. Also :iri in the
table needs no :, but apart from row 1 all other iris are :iris.
In general, I don't believe you can get away with not having a
recursive translation. I suggest to define something like:
Given a PPE X pp Y, we define the translation T(X, pp, Y) as follows
and then give a table with left-hand side listing possible T(x, pp, Y)
and the right-hand side the result, which is recursive:
T(X, iri, Y) = X iri Y
T(X, !(iri_1|...|iri_n), Y) = NPS(X, {iri_1, ..., iri_n, Y)
T(X, !(^iri_1|...|^iri_n), Y) = T(X, ^(!(iri_1, ..., iri_n)), Y)
T(X, !(iri_1|...|iri_i | ^iri_i+1|...|^iri_n), Y) = { T(X, !(iri_1,
..., iri_i), Y) } UNION { T(X, !(^iri_i+1|...|^iriiri_nn), Y) }
T(X, ^path, Y) = T(Y, path, X)
T(X, path1 / path2, Y) = T(X, path1, ?v) . T(?v, path2, Y)
etc

18.2.2.4
If E is *of* the form MINUS {P}
IMO, the patterns in [NOT] EXISTS filters must be translated here
too.

18.2.2.5
extra spacing before the second simplification form

18.2.3
Some examples have just one translation whereas others have two. I
suggest to add a sentence explaining this, e.g.:
If just one algebra translations is given, then this is the simplified
one, whereas if two translations are given, the first one is the
non-simplified one and the second one is the simplified one.

Forth example: group consisting *of* a union of a union and a basic
graph pattern:
Fifth example: extra brace after the second join
Example: Pattern involving BIND, Extend(...) is written all lower case
Minus example: closing brace missing
Subquery example, should be:
Join(
  Bgp(...),
  ToMultiSet(
    Distinct(Project(Bgp(...), {?o}))
  )
)
Currently it misses the closing brace and the proper translation of
Project.
Note that I use a set here for the project variables, which is what
the Project operation expects, but the algebra translation produces a
list at the moment.

18.2.4.1 It divides the solution into *groups of* one or more
solutions, with the same overall cardinality.

Algorithm:
Let P := the *algebra translation* of the group graph pattern of the
query level
According to the grammar, the GROUP BY clause can alo contain
BuiltInCalls and FunctionCalls, which is not handled here and probably
shouldn't be allowed.

The A_i are never combined into A, so A is consistently the empty set.
Instead of constructing A_i, add the Aggregation(...) to A, or set
A:=A_1, ..., A_i-1 before building P.
I don't understand whether the outer for loop is indeed iterating over
all expressions in SElECT and each HAVING, each time executing the
inner IF and two FOR loops as the indentation suggests or whether the
second and third for loop run after the first one is done That's what
I would expect). The first and second FOR loop should probably be
extended to handle SELECT * and the SAMPLE replacement is only
required if the variable/variable in the expression is also not
grouped in GROUP BY.
It took me a loooong time to understand agg_i. Since aggregates anyway
have to be aliased, wouldn't it be simpler to use the alias straight
away and give that even into the Aggregation(...) objects, so that
proper mappings can be constructed straight away? Even you you don't
want to do that, I suggest to give agg_i as the first parameter into
Aggregation(...) since otherwise it kind of falls from the sky later
on. Also worth noting that each agg_i must be a fresh variable
distinct from all other agg_j and variable occurring in Q.

I don't see why G is needed in AggregateJoin(A, G) since each A_i
already has G as parameter and G is not needed as a parameter to
AggregateJoin.

The example mises a brace before COUNT and the ?x should be ?a. I
suggest to extend the example so that also the changed Q with agg_i
instead f the aggregates is shown and A := (A_1, A_2) would help
understanding (in particular since it is now missing totally I
guess).

18.2.4.3
P should probably be a set not a list since tis is expected later in
the algebra evaluation and makes more sense. On that note, is SELECT
?x ?x allowed in SPARQL? Should it just skip one ?x, produce two
result columns for ?x or give an error?
The note following the algorithm should also mention that an error
arises when variable as named target occurred in a previous AS
assignment. Extend in the algorithm is written all lower case (extend(...)
instead of Extend(...)).

18.2.5.2 has still [@@ link]
also note that here the projection vars are expected to be a set not a
list and i should say "all named variables *that are in-scope* in the
query if SELECT * is used

18.2.5.6 If the clause is ... in the co*r*responding position

18.3.1
Basic graph patterns form the basis of SPARQL pattern matching.
That is no longer true since now also PPEs are basic elements that
require matching and generate solution mappings.

It might make sense to move the PPE stuff from 18.4 to Section 18.3
too since so far the structure is to first treat basic elements,
i.e., operations that generate solutions, and then introduce the
higher level operations that work with the solutions generated from
the basic steps.

It might also make sens to label 18.3 and 18.4 "Basic Algebra
Expressions" and "Complex Algebra Expressions", respectively.

18.4 Definition: Filter. This no longer works for [NOT] EXISTS since
evaluating the expressions needs the dataset with its active graph
D(G).

"There are 4 property path operators in the SPARQL algebra"
but then only three definitions follow: NegatedPropertySet is
missing.
I find the  given definitions not very clear. They are not really
defining anything; it is more an intuitive explanation. I suggest to
define the evaluation properly here (or in 18.3 if they are moved to
18.3), i.e., move the definitions of the evaluation of these functions
here. This would make the structure more consistent.

Definition: Extend
The function in the definition and in some other places is written all
lower case, to be consistent it should be Extend(...)
extend is undefined when var in dom(μ).
is the only item not indented as the others
@@ Define the case for var in dom(μ) (does not arise in SELECT
expressions)
is done as far as I can tell.

Definition: ListEval
I suggest to rephrase
ListEval(exprlist, μ) returns a list E, where Ei =
μ(exprlisti) or error.
into
ListEval((expr_1, ..., expr_n), μ) returns a list (e_1, ..., e_n), where e_i =
μ(expr_i) or error.
It took me a long time to understand that E is the list of E_i, which
are to be constructed from the given expressions in the list. Making
the list elements more obvious might help.

ListEval((unbound)) = (error), as the evaluation of an unbound
expression is an error.
Here ListEval is missing an argument (μ):
ListEval((unbound), μ) = (error)

Definition: Group
Uses a "nice" arrow to define mappings for a function. Previously the
spec used -> for that purpose. The proper symbol would actually be
what Latex does with \mapsto, which has a small vertical bar at the
left end of the arrow. I suggest to stick to one notations. e.g., keep
using ascii ->.

For the example following the Group Definition, I suggest to first
give the query and it might be more helpful to use a unary aggregate
as all the standard ones defined by SPARQL are unary.
The grouping actually maps to multisets of existing solution mappings,
but here the ?x mapping got lost. It might make sense to label the
rows in the solution sequence table with μ_1 to μ_3, then you can say:
We produce G = Group((?x), Ω) = { (1) -> { μ_1, μ_2 }, (2) -> { μ_3 } }
Note that in the spec the example just uses , whereas the definition
introduces the nice arrow, which is -> elsewhere. Also, the example
looses the ?x mappings, but according to the Group definition, we
should map from the key to the solution complete mappings for that
key.

I also would like to see the algebra translation since the
algebra transformation for aggregates rewrites the query. As I
understand it, we get
Q = SELECT (?agg_1 AS ?agg) WHERE { ?x ?y ?z } GROUP BY ?x
and the algebra object:
AggregateJoin(Aggregration((?y, ?z), ex:agg, {}, G), G)
from the algebra translation up to the group by and aggregate steps.
>From the select expression translation, I would then get
Extend(AO, ?agg, ?agg_1)
with AO being the so far computed algebra object
AggregateJoin(...). Finally, I get Project(AO, {?agg})
with AO being Extend(...)

At this point, it has not yet been defined how Aggregate(...) or
AggregateJoin(...) is evaluated, so I suggest to say that you come
back to the example later instead of evaluating Aggregate (but not
AggregateJoin) although the reader doesn't yet know how that is
defined.

Definition: Flatten
I suggest to use lower case characters for the actual elements of a
(multi)set and upper case characters for (multi)sets.
The Flatten(M) function takes a multiset M = { L_1, ..., L_n } of
lists and returns { l | L in M and l in L }.
or
The Flatten(M) function takes a multiset M = { L_1, ..., L_n } of
lists and returns the multiset union of all multisets M_i = { l | l in
L_i } for all i with 1 <= i <= n.

Count is a SPARQL set function which counts the number of times a
given expression has a bound, and non-error value with*in* the
aggregate group.

Why is COUNT the only function getting also an xsd:integer err as
input? The function is unary everywhere else and err is not used in
the definition.

All the later aggregate definitions overload the function by defining
it first for a multiset, which internally defines the same function
for a list. It can only be assumed that S ind Sum(S) is then a
list. It might be clearer to say:
Sum(multiset M) = SumL(ToList(Flatten(M)))
SumL(list L) = op:numeric-add(...)
Also here |S| is used to denote the cardinality, but that hasn't been
defined. For (multi)sets card[M] is used. Also S_0 and S_1...n is used
to refer to the first (index 0) element of the list and the sublist
containing the second to last element, which should then n-1 if you
start from 0. This should be clarified or you just define
SumL( (s_1, ..., s_n) ) = ... n > 1
SumL( (s_1) ) = ... and
SumL( () ) = 0
I suggest to put the example outside of the definition and it seems to
be wrong. It should be:
Sum({1,2,3})=op:numeric-add(1, op:numeric-add(2, op:numeric-add(3,
0))).

For Min(M) I also suggest to use MinL internally, e.g.,
Min(M) = MinL( ToList(M) )
MinL( (l_1, ..., l_n) ) orders (l_1, ..., l_n) according to ORDER BY
ASC and returns the first element of the ordered sequence.
Same for Max.

Definition: Sample
I suggest to say already in the definition that v is *an arbitrary*
element from Flatten(M).

Definition: ToMultiSet
... The order and any duplicates in the sequence have no effect on the
resulting multiset.
It's not clear what it means for duplicates to have no effect. Do they
occur only once in the resulting multiset, i.e., only the  first
element has an effect in that it contributes a value whereas the second
or more occurrences don't contribute? Or does it means duplicates get
no special treatment and each one contributes an element? How about
saying that duplicates in the list are (not) preserved in the
resulting multiset. An example might also help.

18.5
We define eval(D(G), graph pattern)
should probably be
We define eval(D(G), algebra object) since we don't work on graph
patterns here. Even of that was to mean algebra object for a graph
pattern that's not really correct since the section also defines the
evaluation of Project etc.

I also find the order very random, I suggest to start with Bgp(...)
and the PPEs, just referring back to Section 18.3 (where I would
define not only BGP evaluation but also PPE evaluation)
Also note that for Bgp(...) the link says
See section 12.3 Basic Graph Patterns
whereas it should say 18.3 (the link is correct)

The definition for the evaluation of Group(...) (from aggregats) is missing.

Definition: Evaluation of Aggregation
Aggregation applies a set function “func” to a multiset of lists of
expressions and a grouped solution sequence, G as produced by the Group
function. It produces a single value for each key and partition for that key
(key, X).
should be
Aggregation applies a set function “func” to a multiset of lists of
expressions and a grouped solution sequence, ***P*** (G is used in
D(G) for the active graph) as produced by the Group function. It
produces a single value for each key and partition for that key (key
****->***** X).
M should be defined as:
 M = { L | L = ListEval(exprlist, μ) for μ in ran(g) }
or
 M = { ListEval(exprlist, μ) | μ in range(g) }
since ran(g) is a set of solution mappings, whereas ListEval is
defined for an expression list and one solution mapping. Also
range(function) has not been defined before although it is pretty
obvious, but for the domain of a solution there is a special
definition.

Special Case: ... will be *the* cardinality of ...

Can I actually use COUNT(DISTINCT *)? That doesn't seem to make much
sense to me.

Note that the purpose of the expression card[range(g)] - card[M] is to
indicate to the set function the number of expressions that evaluated
to an error.
card[range(g)] is only used in COUNT(*), how does that relate to error
indication in general?

For example the aggregate expression GROUP_BY(?x ; separator="|")
should be GROUP_CONCAT


Definition: Evaluation of AggregateJoin
...eval(D(G), AggregateJoin(A, P) ***)***
missing brace

I would much prefer (as said above) if the temporary variables agg_i
were given as parameter to each A_i = Aggregation(...) since otherwise
the algebra object doesn't contain all the information required to
evaluate it. I have to know which variable name was used for each A_i,
since I can hardly assume it is really agg_i given that agg_i is not a
forbidden variable name and an evil user might use it within a
query.

I also assume the definition is wrong, I want
eval(D(G), AggregateJoin(A, P))={ (agg_1, v_1), ..., (agg_n, v_n) |
v_i such that ( k, v_i ) in eval(D(G), A_i) for some k and each 1 <= i
<= n }

Definition: Evaluation of Extend
eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P))
should be
eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr)
Extend gets the algebra object as first parameter, which is then
evaluated into a solution sequence and again first char of Extend
should be upper case.

I suggest at this point to go back to the aggregation example from
18.4:
I got
Project(Extend(AJ, ?agg, ?agg_1), {?agg})
with
AJ=AggregateJoin(Aggregration((?y, ?z), ex:agg, {}, G), G)
and
G = { (1) -> { μ_1, μ_2 }, (2) -> { μ_3 } }
Aggregration((?y, ?z), ex:agg, {}, G) gives
g_1 :  (1) -> { μ_1, μ_2 }
g_2 : (2) -> { μ_3 }
so we get
{
( (1), ex:agg( { (2, 3) , (3, 4) }, {} ) ),
( (2), ex:agg( { (5, 6) }, {}) )
}
Lets for simplicity assume ex:agg is actually SUM and the query had
SUM(?y), then I would get
{
( (1), Sum( { (2) , (3) } ) ),
( (2), Sum( { (5) } ) )
}
= { ( (1), 5 ), ( (2), 5 ) }
>From the AggregateJoin we then get
{ { (agg_1, 5) }, { (agg_1, 5) } }
where each set in the multiset is one solution mapping.

Extend(...) then gives
{ { (agg_1, 5), (agg, 5) }, { (agg_1, 5), (agg, 5) } }
although I am not sure where it says that eval(agg_1) gives 5. The
link for expression links to Section 17 and I can't see that just
using a variable is fine...

Finally, Project(...) gives two solutions both mapping agg to 5.


Definition: Evaluation of ZeroLengthPath
I suggest to define Nodes(G) for a graph G above the ZeroLengthPath
definition since it is also used elsewhere, e.g., in the evaluation of
OneOrMorePath.
The notion xxx:type is used here for the first time and it might b
worth saying something like: For a term (or variable) x, x:t
denotes that x is of type t.
term in nodes(G) should be term in nodes(D(G))
card[ ] is not clear
is the last card[ ] not 0 assuming it denotes the cardinality for each
mapping in the solution?

I can't fully understand the function ALP. I guess eval(x, path) needs
D(G) and then does something like Bgp(x path ?y) evaluated over
D(G). The path here is a normal predicate I think and it might help to
make that clear.

Definition: Evaluation of ZeroOrMorePath
Should produce multisets of solution mappings, i.e.,
{ { (vy, n) } | n in ALP(x, path) }
(note extra curly brackets) and not
{ (vy, n) | n in ALP(x, path) }
same for all eval defs in the definition

Definition: Evaluation of OneOrMorePath
eval(D(G), OneOrMorePath(X, path, Y)
is missing the final closing brace

still has
@@Change to algorithmic form??

Definition: Evaluation of NegatedPropertySet
where X and Y *are* (not for) variables or RDF terms
In this definition and the remaining ones, D(G) is required, but only
D is written

The second
Definition: Evaluation of ToList
should be
Definition: Evaluation of Subquery

18.6
The overall SPARQL design can be used for queries which assume a more
elaborate form of entailment than simple entailment, by re-writing the
matching conditions for basic graph patterns.

This is no longer true due to PPEs. I am not happy about this at all
and I assumed that PPEs are optional features. If they are not, it is
quite unfortunate that the so far existing extension point no longer
really is one and something has to be done at least to clarify this!


-- 
Dr. Birte Glimm, Room 309
Computing Laboratory
Parks Road
Oxford
OX1 3QD
United Kingdom
+44 (0)1865 283520
Received on Tuesday, 15 March 2011 00:35:10 UTC