Re: ungrouped variables used in projections - Further implications? from Andy Seaborne on 2010-08-26 (public-rdf-dawg@w3.org from July to September 2010)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Thu, 26 Aug 2010 14:48:36 +0100
To: Axel Polleres <axel.polleres@deri.org>
CC: SPARQL Working Group <public-rdf-dawg@w3.org>, Lee Feigenbaum <lee@thefigtrees.net>, Steve Harris <steve.harris@garlik.com>
Message-ID: <4C7670B4.6090602@epimorphics.com>
On 25/08/10 22:24, Axel Polleres wrote:
>>> Any opinions on this? This actually worries me about the current  "potentially bound" wording.
>>
>> If we want a static analysis of the query, then regard ?Y as potentially
>> bound.
>
> We'd need to explain/define the exact reading of "regard as potentially bound". As it stands it is unclear. My example *could* be detected by static analysis, if the static analyser was able to detect *statically* unsatisfiable FILTER expressions, such as (?Y != ?Y) , so it is not clear why  ?Y should be regarded as as potentially bound in my example. I am fine with any formulation which has a clear definition, the current wording unfortunately does not.

In general, analysing expressions for non-satisfiability is not 
practical.  Only some simple cases like you example are possible as are 
other forms optimizing compilers might notice.  Once reordering and 
equivalence are added, the complexity cost grows.  And what about 
structural invariances like FILTER(?Y>45 && isIRI(?Y)) or 
data-introduced data-introduced FILTER(?ageInYear < -10)?

A practical scheme is based on sites where variables are bound in BGP 
within patterns.

This relates to the GROUP BY and expression handling.  Detecting when 
one expression (in the SELECT line) uses another (from the GROUP BY) is 
complex because complex expressions can be written in different ways yet 
be equivalent expressions.

Therefore, I suggest we do not requite implementations to be able to 
perform such comparisons.

My outline definition of potentially bound is a practical algorithm 
based on just the points where a variable can be bound (not filtered 
out).  There are only a few places where terms can be bound to variables.

> My proposal for rewording was maybe too restrictive, but it was clearly checkable statically.
> BTW, the current wording for "SELECT *" is equally ambiguous.
>
> [...]
>
>>>> Unnecessarily severe.
>>>
>>> Fair enough, if we can afford it. Though it seems that expressions in GROUP BY are strictly speaking not necessary, and seem to be replaceable quite easily, so I wouldn't consider this restriction severe.
>>
>> It's severe because it's the corner case driving the main design.  And
>> you were arguing for shorter syntax.
>
> Yes, but your version leaves us with something very restricted, it seems.
> you say you'd disallow agg08...

Why is it "very restricted"?

It's a restriction but I don't see it as /very/ restricting, especially 
as you have already shown that if the app needs the value of the 
grouping returned it can do so using a nested SELECT.

The balance is the difficulty of determining whether one expression is a 
sub-expression of another, including reordering and rewriting.

Consider

GROUP BY (1/?o)

then

SELECT (fn:floor(1/(-2*?o))+count(*)))

is theoretically safe.  When two or more variables are involved, it gets 
complicated.

>> agg08 uses an expression for GROUP BY. I am suggesting, as a
>> simplification, that it does not put ?O1 and ?O2, not (?O1+?O2), as
>> legal uses in an expression in the SELECT clause.
>
> That would be a quite different query, wouldn't it? Can you show me what exactly your simplification means for the agg08 query?

agg08 would be an error because it uses variables in an expression which 
are not key variables of the group.

> Let me try to understand again what you propose:
> - you want to allow only grouped variables being projected or used in project expressions

Yes, understanding "grouped variables" as variables used in GROUP BY, 
but not in an expression.

> - you additionally want to allow grouping by expressions, but the grouped expressions are not reusable in the SELECT clause.
> yes?

Yes.  I'm following the current doc which allows grouping by expression 
(syntax and definition reading ExprList as a list of expressions).

Group(ExprList, Ω) =
   { ListEval(ExprList, μ) ->
     { μ' | μ' in Ω, ListEval(ExprList, μ) = ListEval(ExprList, μ') }
     | μ in Ω }

> If so, it seems our arguments run a bit past each other...
> You seem to propagate a stronger restriction than me for GROUPing, but a weaker restriction than mine for variables allowed as names in project expressions?

I suggest a stronger restriction on the variables allowed in project 
expressions (and projections and HAVING) in that it only considers 
variables.  This is because of the complexity of determining whether one 
expression is "safe" given an expression used for GROUP BY.

Otherwise we are trying to allow:

SELECT (?o1+?o2 AS ?o3) ... GROUP BY (?o2+?o1)

SELECT (1/(?o1+?o2) AS ?o3) ... GROUP BY (?o2+?o1)

unclear about:
SELECT (fn:floor(2*?o1+2*?o2)) AS ?o3) ... GROUP BY (?o2+?o1)

but not
SELECT ?o1 ... GROUP (?o2+?o1)

I don't think that removing the possibility of GROUP BY with an 
expression would be particularly serious; however, there is no reason to 
forbid it (the issue is expressions in SELECT with constant value within 
a group, not the GROUP BY clause) and it is in the current draft.

I'm not sure what you propose.  You have mentioned no expressions in 
GROUP BY and also allowing reuse of the same expression used in the 
GROUP BY in the select expressions.  For the latter, I haven't seen what 
equivalence of expressions, or inclusion of expressions, is involved.

 Andy
Received on Thursday, 26 August 2010 13:49:23 UTC