Re: ungrouped variables used in projections - Further implications? from Andy Seaborne on 2010-08-25 (public-rdf-dawg@w3.org from July to September 2010)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Wed, 25 Aug 2010 20:18:13 +0100
To: Axel Polleres <axel.polleres@deri.org>
CC: SPARQL Working Group <public-rdf-dawg@w3.org>, Lee Feigenbaum <lee@thefigtrees.net>, Steve Harris <steve.harris@garlik.com>
Message-ID: <4C756C75.5020402@epimorphics.com>
On 25/08/10 19:07, Axel Polleres wrote:
>
> On 25 Aug 2010, at 18:33, Andy Seaborne wrote:
>
>> On 25/08/10 18:15, Axel Polleres wrote:
>>> Thanks for these very useful examples, Andy! (which I think brought me to another
>>> imprecise formulation in the spec, I think)
>>>
>>> Questions for clarification, to make sure everybody is on the same page here:
>>>
>>> 1)
>>>> SELECT *
>>>> {
>>>>       { SELECT ?x { ?x ?p ?o } GROUP BY ?x }
>>>>       ?o<p>   123 .
>>>> }
>>>
>>> Yup, we want to allow this, right?
>>
>> Yes
>
> ok.
>
>>
>>>
>>> 2)
>>>>     SELECT (count(*) AS ?p) { ?s ?p ?o } GROUP BY ?s
>>> ...
>>>>     SELECT (SAMPLE(?p) AS ?p) { ?s ?p ?o } GROUP BY ?s
>>>
>>> This is seemingly (but strangely enough not quite?) in conflict with:
>>> "The new variable is introduced using the keyword AS; it must not already be potentially
>>> bound."
>>>
>>> I'd honestly prefer somehow to strenghten this restriction to:
>>>
>>> "The new variable is introduced using the keyword AS; it must not already occur in the WHERE clause."
>>
>> Disagree - the GROUP example puts the inner variable out of scope.
>
> I don't really understand? With what exactly do you disagree?
 >
> I think we both agree that the current wording doesn't
> "The new variable is introduced using the keyword AS; it must not already be potentially bound."
> apply to your example.
 >
> My proposal was to strengthen this restriction such that your examples would also be forbidden,
> is it this you are disagreeing with or do you disagree that my rewording catches your example?

I disagree with your rewording and additional the restriction.

I see no reason that a name should not introduced (by AS) if it does not 
conflict with anything.  If the pattern does not expose a name, it 
should possible to use the name.  Aids composition - people do seem to 
create large queries by working on fragments.

SELECT (?s AS ?subject) (?t AS ?p)
{
     {SELECT DISTINCT ?s {?s ?p ?o}} # Hides  ?p ?o
     ?s rdf:type ?t
}

Just because something looks bad style is not a reason to ban it.

>> An inner SELECT/project would do much the same - it's not just GROUPing.
>
> Well, the strenghtened restriction would also forbid variables occurring in a nested
> query in the WHERE clause.

Quite - and unnecessarily so.

>> Building queries by combining tested fragments is made much harder if
>> there are whole-query rules that mean a fragment worked on its own
>> breaks a larger query.
>
> We have this effect already with the current restriction and I don't see why it gets
> more difficult by strengthening the restricion.

Do we?  Where?  Variables can be hidden by subqueries.

>>> Funny enough, note that the original "potentially bound" formulation is problematic/imprecise already
>>> without aggregates:
>>>
>>>    SELECT (?X as ?Y) WHERE { ?S ?P ?X OPTIONAL { ?S ?P ?Y FILTER(?Y != ?Y) } }
>>>
>>> Obviously, there is no way that ?Y ever returns a binding by the FILTER expression...
>>> so it is not "potentially bound" and that query would be syntactically ok, according to the definition.
>>> I guess many will agree that checking static unsatisfiability of FILTER expressions would be a nightmare for parsers :-)
>
> Any opinions on this? This actually worries me about the current  "potentially bound" wording.

If we want a static analysis of the query, then regard ?Y as potentially 
bound.

(we have avoided dynamic analysis errors to date - it's very hard to 
send the error mid way through a query - HTTP does not like that so it 
would require running to completion before sending the HTTP return code 
and hence any results)

>>> 3)
>>>> Personally, I'd be happy with forbidding the use variables of grouping
>>>> expressions:
>>>>
>>>>     SELECT (1/(1-?o) AS ?o1) { ?s ?p ?o } GROUP BY (1/(1-?o)) # Forbiddable
>>>>     SELECT ?o WHERE { ?s ?p ?o } GROUP BY (1/(1-?o)) # Forbiddable
>>
>>> Without expressing any strong opinion here: This rules out the new test case agg08, or, resp.,
>>> turns it into a negativeSyntaxTest. I had assumed for the current version of agg08 that the
>>> former would be allowed whereas the latter wouldn't. That's why I had "*or expressions*" in
>>> my rewording proposal.
>>
>> It does - it's a trade off - testing whether an expression is the same
>> as another is tricky.
>>
>>> I assume what Andy means here (and which I think holds) is that we could forbid expressions
>>> in Grouping alltogether, since they can be always emulated by subqueries, i.e.
>>
>> Not what I mean.
>>
>> I am suggesting simplifying by not requiring an impl to spot when two
>> expressions are the same.
>
> ... but you would still allow the same expressions, i.e agg08 would still be fine, yes?

No.

----
SELECT ((?O1 + ?O2) AS ?O12) (COUNT(?O1) AS ?C)
WHERE { ?S :p ?O1; :q ?O2 } GROUP BY (?O1 + ?O2)
ORDER BY ?O12
-----

agg08 uses an expression for GROUP BY.  I am suggesting, as a 
simplification, that it does not put ?O1 and ?O2, not (?O1+?O2), as 
legal uses in an expression in the SELECT clause.

> Then, I don't really understand your rewording proposal:
>
> ============
> [[
> In aggregate queries and sub-queries, variables that appear in the query
> pattern, but are not used to group the pattern, cannot be projected nor
> used in expressions in SELECT clause nor used in the expression of a
> HAVING clause of this query or sub-query unless they are part of an
> aggregate.
>
> They may be used as alias names.
>
> In order to project arbitrary expressions the SAMPLE aggregate may be used.
> ]]
>
> By saying "expressions" the use as alias names comes for free but it's
> clearer to say so.
> ============
>
> Can you explain what you mean by "alias names" exactly?

New variable names introduced with AS.

> You mean to capture the same as I said with *or expressions* in my rewording, or something more general? I think we'd need to explain that notion.
>
>>
>> SELECT (1/(-?o+1) AS ?o1) ... GROUP BY (1/(1-?o))
>>          ^^^^^^^^^^
>
> Aah, different, overread that, sorry.
>
>>
>> Use of ?o in any expression in projection (or HAVING - it's the same
>> thing) is forbidden.
>>
>>>      SELECT (1/(1-?o) AS ?o1) { ?s ?p ?o } GROUP BY (1/(1-?o))
>>>
>>> could be written without expression in the GROUP BY clause as:
>>>
>>>      SELECT ?o1 { SELECT (1/(1-?o) AS ?o1) { ?s ?p ?o } } GROUP BY ?o1 }
>>>
>>> So, why not just doing just that and forbidding expressions in GROUP BY in the grammar already?
>>
>> Unnecessarily severe.
>
> Fair enough, if we can afford it. Though it seems that expressions in GROUP BY are strictly speaking not necessary, and seem to be replaceable quite easily, so I wouldn't consider this restriction severe.

It's severe because it's the corner case driving the main design.  And 
you were arguing for shorter syntax.

ARQ actually works by introducing a hidden variable for aggregate so 
it's use in HAVING or SELECT clauses is just use of that variable and a 
single evaluation of the aggregates value for each group.

>> Doing that because minor issue of the expressions in SELECT are tricky
>> seems to have the balance all wrong.
>
> You mean expressions in GROUP BY, yes?
>
>>
>>> 4) BTW, what about
>>>        SELECT * { ?s ?p ?o } GROUP BY ?s
>>>    Just to make sure everybody is on the same page here: is this also forbidden?
>>
>> No - it's natural.
>
> What I meant to say is currently it would be... reading * as a shortcut of all variables occurring in the WHERE clause.... BTW, the current formulation
> "The syntax SELECT * is an abbreviation that selects all of the variables that could be bound in a query."
> has the same problem as the "potentially bound" formulation mentioned earlier
> ... so we need to reformulate that anyways.

The section is from SPARQL 1.0.

"Potentially bound" is a static analysis (well, it is for ARQ) of the 
query based on use in BGPs, GRAPH (so not if used in a FILTER alone) and 
now the name introduction forms .

>> Define the scoping of a group as the key variables (not expressions used
>> in GROUP BY) and it works out easily.
>
> We need to find the right wording, since the notation "key" is only explained back in the algebra,
> introducing it for defining the restrictions on variables further up in the spec already might be difficult?

My suggestion is just to cover variables, not expressions, used in GROUP 
BY.  Simple enough?

>
> Axel

	Andy
Received on Wednesday, 25 August 2010 19:19:02 UTC