Re: ungrouped variables used in projections - Further implications? from Axel Polleres on 2010-08-26 (public-rdf-dawg@w3.org from July to September 2010)

From: Axel Polleres <axel.polleres@deri.org>
Date: Thu, 26 Aug 2010 15:46:03 +0100
To: "Andy Seaborne" <andy.seaborne@epimorphics.com>
Cc: "SPARQL Working Group" <public-rdf-dawg@w3.org>, "Lee Feigenbaum" <lee@thefigtrees.net>, "Steve Harris" <steve.harris@garlik.com>
Message-Id: <C8AAD659-79DA-4246-8376-01EBB12A3BAC@deri.org>
On 26 Aug 2010, at 14:48, Andy Seaborne wrote:

> 
> 
> On 25/08/10 22:24, Axel Polleres wrote:
> >>> Any opinions on this? This actually worries me about the current  "potentially bound" wording.
> >>
> >> If we want a static analysis of the query, then regard ?Y as potentially
> >> bound.
> >
> > We'd need to explain/define the exact reading of "regard as potentially bound". As it stands it is unclear. My example *could* be detected by static analysis, if the static analyser was able to detect *statically* unsatisfiable FILTER expressions, such as (?Y != ?Y) , so it is not clear why  ?Y should be regarded as as potentially bound in my example. I am fine with any formulation which has a clear definition, the current wording unfortunately does not.
> 
> In general, analysing expressions for non-satisfiability is not
> practical.

we are in wild agreement on this! ...

>  Only some simple cases like you example are possible as are
> other forms optimizing compilers might notice. Once reordering and
> equivalence are added, the complexity cost grows.  And what about
> structural invariances like FILTER(?Y>45 && isIRI(?Y)) or
> data-introduced data-introduced FILTER(?ageInYear < -10)?
> 
> A practical scheme is based on sites where variables are bound in BGP
> within patterns.
> 
> This relates to the GROUP BY and expression handling.  Detecting when
> one expression (in the SELECT line) uses another (from the GROUP BY) is
> complex because complex expressions can be written in different ways yet
> be equivalent expressions.
> 
> Therefore, I suggest we do not requite implementations to be able to
> perform such comparisons.

... and this!

> 
> My outline definition of potentially bound is a practical algorithm
> based on just the points where a variable can be bound (not filtered
> out).  There are only a few places where terms can be bound to variables.

Ok, my main point was that we'd need to have this defined in the spec clearly, which currently isn't the case.

> > My proposal for rewording was maybe too restrictive, but it was clearly checkable statically.
> > BTW, the current wording for "SELECT *" is equally ambiguous.
> >
> > [...]
> >
> >>>> Unnecessarily severe.
> >>>
> >>> Fair enough, if we can afford it. Though it seems that expressions in GROUP BY are strictly speaking not necessary, and seem to be replaceable quite easily, so I wouldn't consider this restriction severe.
> >>
> >> It's severe because it's the corner case driving the main design.  And
> >> you were arguing for shorter syntax.
> >
> > Yes, but your version leaves us with something very restricted, it seems.
> > you say you'd disallow agg08...
> 
> Why is it "very restricted"?

You seem to restrict more than I would.
> 
> It's a restriction but I don't see it as /very/ restricting, especially
> as you have already shown that if the app needs the value of the
> grouping returned it can do so using a nested SELECT.
> 
> The balance is the difficulty of determining whether one expression is a
> sub-expression of another, including reordering and rewriting.
> 
> Consider
> 
> GROUP BY (1/?o)
> 
> then
> 
> SELECT (fn:floor(1/(-2*?o))+count(*)))

Sure, but I had maent to allow only the *exact same* expression as the 
grouped expression as subexpression.

> is theoretically safe.  When two or more variables are involved, it gets
> complicated.
> 
> >> agg08 uses an expression for GROUP BY. I am suggesting, as a
> >> simplification, that it does not put ?O1 and ?O2, not (?O1+?O2), as
> >> legal uses in an expression in the SELECT clause.
> >
> > That would be a quite different query, wouldn't it? Can you show me what exactly your simplification means for the agg08 query?
> 
> agg08 would be an error because it uses variables in an expression which
> are not key variables of the group.

> > Let me try to understand again what you propose:
> > - you want to allow only grouped variables being projected or used in project expressions
> 
> Yes, understanding "grouped variables" as variables used in GROUP BY,
> but not in an expression.
> 
> > - you additionally want to allow grouping by expressions, but the grouped expressions are not reusable in the SELECT clause.
> > yes?
> 
> Yes.  I'm following the current doc which allows grouping by expression
> (syntax and definition reading ExprList as a list of expressions).
> 
> Group(ExprList, Ω) =
>    { ListEval(ExprList, μ) ->
>      { μ' | μ' in Ω, ListEval(ExprList, μ) = ListEval(ExprList, μ') }
>      | μ in Ω }
> 
> > If so, it seems our arguments run a bit past each other...
> > You seem to propagate a stronger restriction than me for GROUPing, but a weaker restriction than mine for variables allowed as names in project expressions?
> 
> I suggest a stronger restriction on the variables allowed in project
> expressions (and projections and HAVING) in that it only considers
> variables.  This is because of the complexity of determining whether one
> expression is "safe" given an expression used for GROUP BY.
> 
> Otherwise we are trying to allow:
> 
> SELECT (?o1+?o2 AS ?o3) ... GROUP BY (?o2+?o1)

> SELECT (1/(?o1+?o2) AS ?o3) ... GROUP BY (?o2+?o1)
> 

Would've been allowed in my current understanding, but am not religious about it.

> unclear about:
> SELECT (fn:floor(2*?o1+2*?o2)) AS ?o3) ... GROUP BY (?o2+?o1)
> 
not allowed in my understanding

> but not
> SELECT ?o1 ... GROUP (?o2+?o1)

clearly not allowed in my understanding (for obvious reasons... different ?o1 values can contribute to the same (?o2+?o1) values, actually that is what agg08 should demonstrate.

> 
> I don't think that removing the possibility of GROUP BY with an
> expression would be particularly serious; however, there is no reason to
> forbid it (the issue is expressions in SELECT with constant value within
> a group, not the GROUP BY clause) and it is in the current draft.
> 
> I'm not sure what you propose. You have mentioned no expressions in
> GROUP BY and also allowing reuse of the same expression used in the
> GROUP BY in the select expressions.

yes.

>  For the latter, I haven't seen what
> equivalence of expressions,

syntactical equivalence, but...

> We could be consistent with SELECT expressions and go so far as to require the AS is an expression is used.

... that sounds reasonable to me as well.


Axel

>
Received on Thursday, 26 August 2010 14:46:37 UTC