Re: Semantics of multi-expression aggregators from Andy Seaborne on 2010-03-19 (public-rdf-dawg@w3.org from January to March 2010)

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Fri, 19 Mar 2010 09:48:19 +0000
To: Steve Harris <steve.harris@garlik.com>
CC: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
Message-ID: <4BA34863.7040900@talis.com>
Option 1: -1

Our custom aggregate expressions are supposed to be like function calls 
(with addition [] and DISTINCT) so are already supporting multiple 
expressions.  It would been AGG(,) or some other 
different-from-function-call syntax, to limit custom aggregate syntax to 
one argument in the grammar rules.

Option 2: -1

I don't read the current doc as saying clearly that multiple expressions 
are supported.  The nearest text seems to be (and we already know this 
needs reworking):

"""
Aggregation applies a function func to a multiset of expressions.
"""
I read that as the multiset of expressions is from one expression per 
row in the partition.

It may have been the intention to read it as collapsing expressions 
across rows and across partition elements but I wasn't reading it that way.

Option 3: +1

-------

Let Ω be a partition.

ExprMultiSet(Ω) =
   { eval(exprlist,μ) | μ in Ω such that eval(exprlist,μ) is defined }
   UNION
   { e | μ in Ω such that  eval(μ(expr)) is undefined }

where
exprlist = (expr1, expr2, ...)
eval(exprtuple,μ) = (eval(expr1,μ), eval(expr2,μ), ...
and is undefined if any eval(exprN,μ) is undefined

where "e" is some symbol that is distinct from all RDF terms.

card[x in Ω]:
   if DISTINCT:
      card[x] = 1 if there exists x in ExprMultiSet(Ω)
      card[x] = 0 otherwise
   else
      card[x] = count of μ in Ω such that x = eval(exprlist,μ)

--------

Alternative: put "e" in the list for any bad evaluations, and remove the 
UNION.

SUM, COUNT, MIN, MAX, AVG - single expression (DISTINCT? ?x)
COUNT(*), COUNT(DISTINCT *)

SAMPLE, GROUP_CONCAT -- multiple argument expressions.


COUNT with more than one argument seems to be a MySQL-ism and according 
to the document (5.1) only these three forms exist:

COUNT(expr)
COUNT(*)
COUNT(DISTINCT expr,[expr...])

and not COUNT(DISTINCT *)

 Andy



On 17/03/2010 5:44 PM, Steve Harris wrote:
> Hi all,
>
> The Problem:
>
> Some SQL implementations (at least Sybase, Postgres, Oracle) support
> multi-expression aggregates, but not with the multiset semantics as in
> the current working draft.
>
> An example from Postgres is the CORR(a, b) aggregate, which can be used
> like:
>
> w x y
> 1 1 2
> 1 2 3
> 1 3 4
> 2 1 1
> 2 2 2
>
> SELECT w, CORR(x, y) AS z FROM A GROUP BY w;
>
> Following current SPARQL draft the equivalent:
>
> SELECT ?w (CORR(?x, ?y) AS ?z) WHERE { ?w :x ?x ; :y ?y } GROUP BY ?w)
>
> would evaluate as
>
> [Res A]
> w z
> 1 CORR({1, 2, 2, 3, 3, 4})
> 2 CORR({1, 1, 2, 2})
>
> But Postgres etc. users will be expecting
>
> [Res B]
> w z
> 1 CORR({(1, 2), (2, 3), (3, 4)})
> 2 CORR({(1, 1), (2, 2)})
>
> ----
>
> So, there are 3 proposals that make sense to me:
>
> Option 1:
>
> Ban multi expression aggregates, leave decision to future working group.
>
> Advantage: easy, can get consensus on what is best to do in future.
> Common situation in SQL engines (MS SQL Server, MySQL, SQLite, ...).
> Disadvantage: no way to implement stats functions aggregates (for e.g.)
> within standard. COUNT(?x, ?y) equivalent becomes more verbose.
>
> Option 2:
>
> Stick with WD semantics, multi expression aggregates expand to a set of
> values, as Res A above.
>
> Advantage: makes things like COUNT(?x, ?y) easy, algebra is simple
> Disadvantage: rules out things like CORR, unless we specify expression
> ordering is preserved, even if we do that the semantics of them will be
> a little strange. What does CORR(?a, ?b , ?c) do?
>
> Option 3:
>
> Define (multi expression?) aggregates as producing a multiset of lists,
> as Res B above.
>
> Advantage: makes it easy to define stats aggregates in the future (I'm
> not proposing we do them in this round, it's a bit too much to bite off
> IMHO).
> Disadvantage: makes defn. of COUNT() etc. a bit more complex. Makes
> algebra a bit more complex. Questions around whether COUNT({(1, 2), (3,
> 4)}) = COUNT({1, 2, 3, 4}) etc.
>
> ----
>
> My preference is probably Option 3, but I could live with Option 1.
> Option 2 is OK, just we have to accept that stats aggregates in the
> future will be a bit messy.
>
> - Steve
>
Received on Friday, 19 March 2010 09:49:04 UTC