Semantics of multi-expression aggregators from Steve Harris on 2010-03-17 (public-rdf-dawg@w3.org from January to March 2010)

From: Steve Harris <steve.harris@garlik.com>
Date: Wed, 17 Mar 2010 17:44:37 +0000
To: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
Message-Id: <7BE72F3D-FECA-4D55-B6A7-A6D2AA988596@garlik.com>

Hi all,

The Problem:

Some SQL implementations (at least Sybase, Postgres, Oracle) support  
multi-expression aggregates, but not with the multiset semantics as in  
the current working draft.

An example from Postgres is the CORR(a, b) aggregate, which can be  
used like:

w    x    y
1    1    2
1    2    3
1    3    4
2    1    1
2    2    2

SELECT w, CORR(x, y) AS z FROM A GROUP BY w;

Following current SPARQL draft the equivalent:

SELECT ?w (CORR(?x, ?y) AS ?z) WHERE { ?w :x ?x ; :y ?y } GROUP BY ?w)

would evaluate as

[Res A]
w    z
1    CORR({1, 2, 2, 3, 3, 4})
2    CORR({1, 1, 2, 2})

But Postgres etc. users will be expecting

[Res B]
w    z
1    CORR({(1, 2), (2, 3), (3, 4)})
2    CORR({(1, 1), (2, 2)})

----

So, there are 3 proposals that make sense to me:

Option 1:

Ban multi expression aggregates, leave decision to future working group.

Advantage: easy, can get consensus on what is best to do in future.  
Common situation in SQL engines (MS SQL Server, MySQL, SQLite, ...).
Disadvantage: no way to implement stats functions aggregates (for  
e.g.) within standard. COUNT(?x, ?y) equivalent becomes more verbose.

Option 2:

Stick with WD semantics, multi expression aggregates expand to a set  
of values, as Res A above.

Advantage: makes things like COUNT(?x, ?y) easy, algebra is simple
Disadvantage: rules out things like CORR, unless we specify expression  
ordering is preserved, even if we do that the semantics of them will  
be a little strange. What does CORR(?a, ?b , ?c) do?

Option 3:

Define (multi expression?) aggregates as producing a multiset of  
lists, as Res B above.

Advantage: makes it easy to define stats aggregates in the future (I'm  
not proposing we do them in this round, it's a bit too much to bite  
off IMHO).
Disadvantage: makes defn. of COUNT() etc. a bit more complex. Makes  
algebra a bit more complex. Questions around whether COUNT({(1, 2),  
(3, 4)}) = COUNT({1, 2, 3, 4}) etc.

----

My preference is probably Option 3, but I could live with Option 1.  
Option 2 is OK, just we have to accept that stats aggregates in the  
future will be a bit messy.

- Steve

-- 
Steve Harris, Garlik Limited
2 Sheen Road, Richmond, TW9 1AE, UK
+44 20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  
9AD

Received on Wednesday, 17 March 2010 17:45:07 UTC