- From: Birte Glimm <birte.glimm@comlab.ox.ac.uk>
- Date: Tue, 15 Mar 2011 00:34:34 +0000
- To: SPARQL Working Group <public-rdf-dawg@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>, Steve Harris <steve.harris@garlik.com>
Hi Andy, Steve, all, I finally managed to complete the review for the algebra section, which took me quite a long time and I have quite a number of problems with it and/or suggestions for improvement, see below. Best regards, Birte First some comments about non-algebra sections that I noticed while going back and forth between the algebra and the examples in the previous sections: 11.2 The example has an unaliased aggregate, but in 11.1 it says that "It should be noted that as per functions, aggregate expressions must be aliased in order to project them from queries or subqueries." In that sentence the must should be marked as RFC 2119 term: <rfc2119>MUST</rfc2119>. Furthermore, the variable ?a suddenly disappears in the provided solution sequence, so probably ?a was meant to be :a (it is not used anywhere anyway). 17.4.1.4 The algebra operations in this section should be moved to Section 18, where it should be clarified how these algebra operations are introduced in the translation from the abstract syntax to algebra objects. I still have a major concern here and that is really a no-go issue IMO: The pattern in FILTER [NOT] EXISTS pattern is not translated into algebra objects. Thus, the Substitute and Exists functions are not even used yet. Lets consider the example from 8.3.3: SELECT * WHERE { ?a :p ?n FILTER NOT EXISTS { ?a :q ?m FILTER(?n = ?m) } } The query pattern translates to: Filter( fn:not(EXISTS { ?a :q ?m FILTER(?n = ?m) } ), Bgp(?a :p ?n) ) The first BGP is properly translated into the Bgp(...) algebra operator, whereas the other one isn't. IMO, we should get something like: Filter( fn:not( Exists( Filter(?n = ?m, Bgp(?a :q ?m)) ) ), Bgp(?a :p ?n) ) However, even if we get that, the dataset and the active graph is not being handed through to the filter evaluation, whereas here the active graph magically appears in the evaluation of exists. On that note: In the evaluation of Exists, the active graph is the second argument, whereas everywhere else it is the first argument of eval and only there is is written D[g] instead of D(G). More consistent would be: exists(pattern, μ) = true if and only if eval(D(G), substitute(pattern, μ)) has any solutions. That only has the problem that the dataset and the active graph is not available to the filter expressions. Even if we properly translate the pattern in the filter, applying the substitution now has to substitute within algebra objects, but otherwise the only other option I see is to do an algebra translation each time EXISTS is executed. Neither option is clearly specified IMO yet and at some point an algebra translation has to occur since the pattern in the [NOT] EXISTS filter can be any kind of complex pattern possibly even incl. sub-queries. I also believe the Substitute should be defined such that it only applies to terms in the domain of μ. E.g., substitute(pattern, μ) = the pattern formed by replacing every occurrence of a variable v in pattern for which μ is defined by μ(v). OK, now to the really reviewed section: In general, there is quite a mix of different pseudo code style: IF ... THEN ... FI If ... Then ... End If If ... Then ... (no end for the if) If ... (no then) ... Totally mixed use of upper and lower case and even mix of styles within one pseudo code section. This should really be unified. 18.1.1: IRIs are a subset of RDF URI References that omit (no s I believe) the use of spaces. 18.1.7 Property Path (no s) Patterns *A* Property Path is a path in *a* graph G if each ti is a triple of G. Definition: Property Path Expression A property path expression is an expression used to match properties in a graph formed after translation of the path syntax as defined <link>above</link>. The link just links to that very definition. I also assume you mean below (18.2.2.2)? Even then, this definition is not clear to me. I would define a Property Path Expression (PPE) inductively as follows: iri, !(iri_1|...|iri_n), !(^iri_1|...|^iri_n), !(iri_1|...|iri_i|^iri_i+1|...|^iri_n) are PPEs and, for P, P1, and P2 PPEs: ^P, !P, P_1 | P_2, P_1 / P_2, ... are PPEs. Also note that Section 9.1 is called Property Path expressions, but then the syntax forms are based on path elements. What a PPE really is becomes only clear from the definition of Property Path Pattern (PPP), because there one can see that PPEs are what occurs in the predicate position. Before, I wasn't sure whether a PPE is actually a triple pattern extended to allow for path elements as predicates. I suggest to unify notation and call path element in 9.1 PPE. In 18.1.7, I suggest to first define PPEs properly and listing the possibilities explicitly, then define Property Path Pattern, and then Property Path not as a sequence of triples, but as a sequence of PPPs. 18.2 The SPARQL query string is parsed and the abbreviations for IRIs and triple patterns given in *a* (delete) section 4 are applied. The table for the syntax tree misses BINDINGS as solution modifier and Extends should be Extend (no s). 18.2.1 We define a variable to be in-scope if there is a way for a variable to be in the domain *of* a solution mapping... In-scope definition: I suggest to be specific and use "Property Path Expression (PPP)" and "v occurs in PPP" for the second entry. OPTIONAL{P} misses a space: OPTIONAL {P} Below the table: In SELECT expressions, the variable may *be* use*d* in an expression later in the same SELECT clause and may not be *be* (duplicate) assigned again in the same SELECT clause. May and may not could/should be marked as RFC 2119 terms. I would even tend to say MUST NOT be assigned again. 18.2.2.2: Notes: ...The order of forms URI and ^URI in negated property sets is not relevant. URI should be IRI The parsing step interprete*s* (not interpreted) triple patterns as property paths of length 1. I suggest the notation of the table is unified with the earlier example section, where you use elt instead of path. Also :iri in the table needs no :, but apart from row 1 all other iris are :iris. In general, I don't believe you can get away with not having a recursive translation. I suggest to define something like: Given a PPE X pp Y, we define the translation T(X, pp, Y) as follows and then give a table with left-hand side listing possible T(x, pp, Y) and the right-hand side the result, which is recursive: T(X, iri, Y) = X iri Y T(X, !(iri_1|...|iri_n), Y) = NPS(X, {iri_1, ..., iri_n, Y) T(X, !(^iri_1|...|^iri_n), Y) = T(X, ^(!(iri_1, ..., iri_n)), Y) T(X, !(iri_1|...|iri_i | ^iri_i+1|...|^iri_n), Y) = { T(X, !(iri_1, ..., iri_i), Y) } UNION { T(X, !(^iri_i+1|...|^iriiri_nn), Y) } T(X, ^path, Y) = T(Y, path, X) T(X, path1 / path2, Y) = T(X, path1, ?v) . T(?v, path2, Y) etc 18.2.2.4 If E is *of* the form MINUS {P} IMO, the patterns in [NOT] EXISTS filters must be translated here too. 18.2.2.5 extra spacing before the second simplification form 18.2.3 Some examples have just one translation whereas others have two. I suggest to add a sentence explaining this, e.g.: If just one algebra translations is given, then this is the simplified one, whereas if two translations are given, the first one is the non-simplified one and the second one is the simplified one. Forth example: group consisting *of* a union of a union and a basic graph pattern: Fifth example: extra brace after the second join Example: Pattern involving BIND, Extend(...) is written all lower case Minus example: closing brace missing Subquery example, should be: Join( Bgp(...), ToMultiSet( Distinct(Project(Bgp(...), {?o})) ) ) Currently it misses the closing brace and the proper translation of Project. Note that I use a set here for the project variables, which is what the Project operation expects, but the algebra translation produces a list at the moment. 18.2.4.1 It divides the solution into *groups of* one or more solutions, with the same overall cardinality. Algorithm: Let P := the *algebra translation* of the group graph pattern of the query level According to the grammar, the GROUP BY clause can alo contain BuiltInCalls and FunctionCalls, which is not handled here and probably shouldn't be allowed. The A_i are never combined into A, so A is consistently the empty set. Instead of constructing A_i, add the Aggregation(...) to A, or set A:=A_1, ..., A_i-1 before building P. I don't understand whether the outer for loop is indeed iterating over all expressions in SElECT and each HAVING, each time executing the inner IF and two FOR loops as the indentation suggests or whether the second and third for loop run after the first one is done That's what I would expect). The first and second FOR loop should probably be extended to handle SELECT * and the SAMPLE replacement is only required if the variable/variable in the expression is also not grouped in GROUP BY. It took me a loooong time to understand agg_i. Since aggregates anyway have to be aliased, wouldn't it be simpler to use the alias straight away and give that even into the Aggregation(...) objects, so that proper mappings can be constructed straight away? Even you you don't want to do that, I suggest to give agg_i as the first parameter into Aggregation(...) since otherwise it kind of falls from the sky later on. Also worth noting that each agg_i must be a fresh variable distinct from all other agg_j and variable occurring in Q. I don't see why G is needed in AggregateJoin(A, G) since each A_i already has G as parameter and G is not needed as a parameter to AggregateJoin. The example mises a brace before COUNT and the ?x should be ?a. I suggest to extend the example so that also the changed Q with agg_i instead f the aggregates is shown and A := (A_1, A_2) would help understanding (in particular since it is now missing totally I guess). 18.2.4.3 P should probably be a set not a list since tis is expected later in the algebra evaluation and makes more sense. On that note, is SELECT ?x ?x allowed in SPARQL? Should it just skip one ?x, produce two result columns for ?x or give an error? The note following the algorithm should also mention that an error arises when variable as named target occurred in a previous AS assignment. Extend in the algorithm is written all lower case (extend(...) instead of Extend(...)). 18.2.5.2 has still [@@ link] also note that here the projection vars are expected to be a set not a list and i should say "all named variables *that are in-scope* in the query if SELECT * is used 18.2.5.6 If the clause is ... in the co*r*responding position 18.3.1 Basic graph patterns form the basis of SPARQL pattern matching. That is no longer true since now also PPEs are basic elements that require matching and generate solution mappings. It might make sense to move the PPE stuff from 18.4 to Section 18.3 too since so far the structure is to first treat basic elements, i.e., operations that generate solutions, and then introduce the higher level operations that work with the solutions generated from the basic steps. It might also make sens to label 18.3 and 18.4 "Basic Algebra Expressions" and "Complex Algebra Expressions", respectively. 18.4 Definition: Filter. This no longer works for [NOT] EXISTS since evaluating the expressions needs the dataset with its active graph D(G). "There are 4 property path operators in the SPARQL algebra" but then only three definitions follow: NegatedPropertySet is missing. I find the given definitions not very clear. They are not really defining anything; it is more an intuitive explanation. I suggest to define the evaluation properly here (or in 18.3 if they are moved to 18.3), i.e., move the definitions of the evaluation of these functions here. This would make the structure more consistent. Definition: Extend The function in the definition and in some other places is written all lower case, to be consistent it should be Extend(...) extend is undefined when var in dom(μ). is the only item not indented as the others @@ Define the case for var in dom(μ) (does not arise in SELECT expressions) is done as far as I can tell. Definition: ListEval I suggest to rephrase ListEval(exprlist, μ) returns a list E, where Ei = μ(exprlisti) or error. into ListEval((expr_1, ..., expr_n), μ) returns a list (e_1, ..., e_n), where e_i = μ(expr_i) or error. It took me a long time to understand that E is the list of E_i, which are to be constructed from the given expressions in the list. Making the list elements more obvious might help. ListEval((unbound)) = (error), as the evaluation of an unbound expression is an error. Here ListEval is missing an argument (μ): ListEval((unbound), μ) = (error) Definition: Group Uses a "nice" arrow to define mappings for a function. Previously the spec used -> for that purpose. The proper symbol would actually be what Latex does with \mapsto, which has a small vertical bar at the left end of the arrow. I suggest to stick to one notations. e.g., keep using ascii ->. For the example following the Group Definition, I suggest to first give the query and it might be more helpful to use a unary aggregate as all the standard ones defined by SPARQL are unary. The grouping actually maps to multisets of existing solution mappings, but here the ?x mapping got lost. It might make sense to label the rows in the solution sequence table with μ_1 to μ_3, then you can say: We produce G = Group((?x), Ω) = { (1) -> { μ_1, μ_2 }, (2) -> { μ_3 } } Note that in the spec the example just uses , whereas the definition introduces the nice arrow, which is -> elsewhere. Also, the example looses the ?x mappings, but according to the Group definition, we should map from the key to the solution complete mappings for that key. I also would like to see the algebra translation since the algebra transformation for aggregates rewrites the query. As I understand it, we get Q = SELECT (?agg_1 AS ?agg) WHERE { ?x ?y ?z } GROUP BY ?x and the algebra object: AggregateJoin(Aggregration((?y, ?z), ex:agg, {}, G), G) from the algebra translation up to the group by and aggregate steps. >From the select expression translation, I would then get Extend(AO, ?agg, ?agg_1) with AO being the so far computed algebra object AggregateJoin(...). Finally, I get Project(AO, {?agg}) with AO being Extend(...) At this point, it has not yet been defined how Aggregate(...) or AggregateJoin(...) is evaluated, so I suggest to say that you come back to the example later instead of evaluating Aggregate (but not AggregateJoin) although the reader doesn't yet know how that is defined. Definition: Flatten I suggest to use lower case characters for the actual elements of a (multi)set and upper case characters for (multi)sets. The Flatten(M) function takes a multiset M = { L_1, ..., L_n } of lists and returns { l | L in M and l in L }. or The Flatten(M) function takes a multiset M = { L_1, ..., L_n } of lists and returns the multiset union of all multisets M_i = { l | l in L_i } for all i with 1 <= i <= n. Count is a SPARQL set function which counts the number of times a given expression has a bound, and non-error value with*in* the aggregate group. Why is COUNT the only function getting also an xsd:integer err as input? The function is unary everywhere else and err is not used in the definition. All the later aggregate definitions overload the function by defining it first for a multiset, which internally defines the same function for a list. It can only be assumed that S ind Sum(S) is then a list. It might be clearer to say: Sum(multiset M) = SumL(ToList(Flatten(M))) SumL(list L) = op:numeric-add(...) Also here |S| is used to denote the cardinality, but that hasn't been defined. For (multi)sets card[M] is used. Also S_0 and S_1...n is used to refer to the first (index 0) element of the list and the sublist containing the second to last element, which should then n-1 if you start from 0. This should be clarified or you just define SumL( (s_1, ..., s_n) ) = ... n > 1 SumL( (s_1) ) = ... and SumL( () ) = 0 I suggest to put the example outside of the definition and it seems to be wrong. It should be: Sum({1,2,3})=op:numeric-add(1, op:numeric-add(2, op:numeric-add(3, 0))). For Min(M) I also suggest to use MinL internally, e.g., Min(M) = MinL( ToList(M) ) MinL( (l_1, ..., l_n) ) orders (l_1, ..., l_n) according to ORDER BY ASC and returns the first element of the ordered sequence. Same for Max. Definition: Sample I suggest to say already in the definition that v is *an arbitrary* element from Flatten(M). Definition: ToMultiSet ... The order and any duplicates in the sequence have no effect on the resulting multiset. It's not clear what it means for duplicates to have no effect. Do they occur only once in the resulting multiset, i.e., only the first element has an effect in that it contributes a value whereas the second or more occurrences don't contribute? Or does it means duplicates get no special treatment and each one contributes an element? How about saying that duplicates in the list are (not) preserved in the resulting multiset. An example might also help. 18.5 We define eval(D(G), graph pattern) should probably be We define eval(D(G), algebra object) since we don't work on graph patterns here. Even of that was to mean algebra object for a graph pattern that's not really correct since the section also defines the evaluation of Project etc. I also find the order very random, I suggest to start with Bgp(...) and the PPEs, just referring back to Section 18.3 (where I would define not only BGP evaluation but also PPE evaluation) Also note that for Bgp(...) the link says See section 12.3 Basic Graph Patterns whereas it should say 18.3 (the link is correct) The definition for the evaluation of Group(...) (from aggregats) is missing. Definition: Evaluation of Aggregation Aggregation applies a set function “func” to a multiset of lists of expressions and a grouped solution sequence, G as produced by the Group function. It produces a single value for each key and partition for that key (key, X). should be Aggregation applies a set function “func” to a multiset of lists of expressions and a grouped solution sequence, ***P*** (G is used in D(G) for the active graph) as produced by the Group function. It produces a single value for each key and partition for that key (key ****->***** X). M should be defined as: M = { L | L = ListEval(exprlist, μ) for μ in ran(g) } or M = { ListEval(exprlist, μ) | μ in range(g) } since ran(g) is a set of solution mappings, whereas ListEval is defined for an expression list and one solution mapping. Also range(function) has not been defined before although it is pretty obvious, but for the domain of a solution there is a special definition. Special Case: ... will be *the* cardinality of ... Can I actually use COUNT(DISTINCT *)? That doesn't seem to make much sense to me. Note that the purpose of the expression card[range(g)] - card[M] is to indicate to the set function the number of expressions that evaluated to an error. card[range(g)] is only used in COUNT(*), how does that relate to error indication in general? For example the aggregate expression GROUP_BY(?x ; separator="|") should be GROUP_CONCAT Definition: Evaluation of AggregateJoin ...eval(D(G), AggregateJoin(A, P) ***)*** missing brace I would much prefer (as said above) if the temporary variables agg_i were given as parameter to each A_i = Aggregation(...) since otherwise the algebra object doesn't contain all the information required to evaluate it. I have to know which variable name was used for each A_i, since I can hardly assume it is really agg_i given that agg_i is not a forbidden variable name and an evil user might use it within a query. I also assume the definition is wrong, I want eval(D(G), AggregateJoin(A, P))={ (agg_1, v_1), ..., (agg_n, v_n) | v_i such that ( k, v_i ) in eval(D(G), A_i) for some k and each 1 <= i <= n } Definition: Evaluation of Extend eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P)) should be eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr) Extend gets the algebra object as first parameter, which is then evaluated into a solution sequence and again first char of Extend should be upper case. I suggest at this point to go back to the aggregation example from 18.4: I got Project(Extend(AJ, ?agg, ?agg_1), {?agg}) with AJ=AggregateJoin(Aggregration((?y, ?z), ex:agg, {}, G), G) and G = { (1) -> { μ_1, μ_2 }, (2) -> { μ_3 } } Aggregration((?y, ?z), ex:agg, {}, G) gives g_1 : (1) -> { μ_1, μ_2 } g_2 : (2) -> { μ_3 } so we get { ( (1), ex:agg( { (2, 3) , (3, 4) }, {} ) ), ( (2), ex:agg( { (5, 6) }, {}) ) } Lets for simplicity assume ex:agg is actually SUM and the query had SUM(?y), then I would get { ( (1), Sum( { (2) , (3) } ) ), ( (2), Sum( { (5) } ) ) } = { ( (1), 5 ), ( (2), 5 ) } >From the AggregateJoin we then get { { (agg_1, 5) }, { (agg_1, 5) } } where each set in the multiset is one solution mapping. Extend(...) then gives { { (agg_1, 5), (agg, 5) }, { (agg_1, 5), (agg, 5) } } although I am not sure where it says that eval(agg_1) gives 5. The link for expression links to Section 17 and I can't see that just using a variable is fine... Finally, Project(...) gives two solutions both mapping agg to 5. Definition: Evaluation of ZeroLengthPath I suggest to define Nodes(G) for a graph G above the ZeroLengthPath definition since it is also used elsewhere, e.g., in the evaluation of OneOrMorePath. The notion xxx:type is used here for the first time and it might b worth saying something like: For a term (or variable) x, x:t denotes that x is of type t. term in nodes(G) should be term in nodes(D(G)) card[ ] is not clear is the last card[ ] not 0 assuming it denotes the cardinality for each mapping in the solution? I can't fully understand the function ALP. I guess eval(x, path) needs D(G) and then does something like Bgp(x path ?y) evaluated over D(G). The path here is a normal predicate I think and it might help to make that clear. Definition: Evaluation of ZeroOrMorePath Should produce multisets of solution mappings, i.e., { { (vy, n) } | n in ALP(x, path) } (note extra curly brackets) and not { (vy, n) | n in ALP(x, path) } same for all eval defs in the definition Definition: Evaluation of OneOrMorePath eval(D(G), OneOrMorePath(X, path, Y) is missing the final closing brace still has @@Change to algorithmic form?? Definition: Evaluation of NegatedPropertySet where X and Y *are* (not for) variables or RDF terms In this definition and the remaining ones, D(G) is required, but only D is written The second Definition: Evaluation of ToList should be Definition: Evaluation of Subquery 18.6 The overall SPARQL design can be used for queries which assume a more elaborate form of entailment than simple entailment, by re-writing the matching conditions for basic graph patterns. This is no longer true due to PPEs. I am not happy about this at all and I assumed that PPEs are optional features. If they are not, it is quite unfortunate that the so far existing extension point no longer really is one and something has to be done at least to clarify this! -- Dr. Birte Glimm, Room 309 Computing Laboratory Parks Road Oxford OX1 3QD United Kingdom +44 (0)1865 283520
Received on Tuesday, 15 March 2011 00:35:10 UTC