Re: Separator string in GROUP_CONCAT() from Andy Seaborne on 2010-03-07 (public-rdf-dawg@w3.org from January to March 2010)

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Sun, 07 Mar 2010 17:34:42 +0000
To: Steve Harris <steve.harris@garlik.com>
CC: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
Message-ID: <4B93E3B2.2030003@talis.com>
GROUP_CONCAT brings up a number of issues of how aggregates actually 
work.  I've separated out an email about aggregates in general which 
includes a proposal and some test cases.


Some GROUP_CONCAT specific comments:

== Name: I prefer something like "stragg" or "stringAgg" (c.f. xmlagg in 
PostgreSQL) because it only works on strings.

== Choice of separator

A single comma seems to me to a less than ideal choice.
   Names are sometimes written "Family, Given"
   Often you want ", " (two chars) to presentable lists.

A better choice might be one of the separator characters from Unicode 
that is unusual then the application can easily do

   s/separator/myChoice/g

on the result string to get the form it wants.

Some possible characters are:

INVISIBLE SEPARATOR 2063
group separator  001D
record separator  001E
unit separator  001F
sequence concatenation 2040

x1D/x1E/x1F all look sensible possible choices.

Alternatively, don't define it and let the application environment make 
that choice; tests assume a particular setting (I don't like this but I 
mention it for completeness).

== Syntax for the separator

This is the second case of aggregate specific arguments, DISTINCT being 
the other.  The "aggregate modifiers" define a particular case of the 
aggregator operation to be applied over the elements of the group. It's 
like a second-order function - the URI identifies something that when 
given some argument produces the operation to be applied.

Proposal 4 captures this.

** Proposal 4a

     GROUP_CONCAT[","](?name)

as a general aggregate syntax, all aggregates can take an [] argument 
list, including custom aggregates.

We keep the DISTINCT special syntax but it is really another way to 
write aggregate modifiers.

I put it in this order because you can think of it as

  (GROUP_CONCAT[","])(?name)

where

  (GROUP_CONCAT[","])

yields the specific aggregator for GROUP_CONCAT using ",".


== Defining GROUP_CONCAT

As far as I can see, in the current proposed design, groups of no 
elements can't occur (there would be an error and the that particular 
group skipped).

But if we do have a design that allows empty groups, the value of an 
empty GROUP_CONCAT should be "".  This is fixed.

See next message for a proposal where groups of no elements can occur.

Making the syntax special, means it can't be written as custom aggregate 
- that might seem minor but I think it's a reflection that we haven't 
full understood the mechanism yet especially if the custom aggregate 
mechanism is used to provide variations of the built-in aggregator 
semantics.

4/ Semantics and errors/unbound

unbound and error are the same in SPARQL.

Suppose we group a citations database by paper, then GROUP_CONCAT the 
authors of each paper, what happens if one paper has no authors?  As 
currently suggested, the group row is dropped (execution error preparing 
the expressions multiset) but returning nothing or "" is more useful.

This is covered in the general email and proposal for aggregation.

 Andy


On 05/03/2010 11:40 AM, Steve Harris wrote:
> Hi all,
>
> Problem:
>
> There's no way to specify a separator string in the draft GROUP_CONCAT
> aggregate. I have a vague memory that we'd discussed this briefly
> somewhere, F2F2, or on a call maybe, but it's pretty hazy. This was
> brought up in Rob Vesse's recent comment.
>
> Proposal 1:
>
> Leave it as it is. Users cannot specify the separator character, it's
> fixed in the spec.
>
> Upside, very simple. Downside, might limit usefulness.
>
> Probably should make sure there's an escaping function in SPARQL 1.1
> that's compatible with the character.
>
> Proposal 2:
>
> If the GROUP_CONCAT expression list has more than one element, then the
> lexically last one is removed and used as the separator before being
> passed to the Aggregation() algebra function. e.g. GROUP_CONCAT(?x, ?y,
> "|")
>
> Upside, keeps the grammar simple. Downside makes the algebra around
> GROUP_CONCAT weird, might be surprising as the multi-expression
> behaviour will be different to other aggregates.
>
> e.g. in GROUP_CONCAT(?x, ?y) ?y will be an argument to the underlying
> function, not an expression. Would probably have to pick a value of ?y
> to random, a la SAMPLE(), as we don't require that "arguments" to
> aggregates are scalar.
>
> Proposal 3:
>
> Use MySQL syntax to specify it, i.e. GROUP_CONCAT(?x, ?y SEPARATOR "|").
>
> Upside, the same as MySQL (where GROUP_CONCAT comes from), avoids
> weirding algebra. Downside, makes the grammar more complex.
>
> Proposal 4:
>
> Like 3, but with some other explicit syntax. e.g. GROUP_CONCAT(?x,
> ?y)[SEPARATOR "|"]
>
> Upside, avoids weirding algebra. Downside, we have to think of our own
> syntax, no familiarity for MySQL users and probably makes the grammar
> more complex.
>
> ---
>
> My opinion:
>
> I'd take 3, or 1 happily, but I think 4 is a bit arbitrary, and 2 is
> really nasty.
>
> There's also other useful syntax around GROUP_CONCAT, e.g. ORDER BY, so
> I expect a future SPARQL will end up with something like 3 or 4 anyway.
>
> - Steve
>
Received on Sunday, 7 March 2010 17:35:18 UTC