Re: Separator string in GROUP_CONCAT()

On 7 Mar 2010, at 17:34, Andy Seaborne wrote:
> GROUP_CONCAT brings up a number of issues of how aggregates actually  
> work.  I've separated out an email about aggregates in general which  
> includes a proposal and some test cases.
> Some GROUP_CONCAT specific comments:
> == Name: I prefer something like "stragg" or "stringAgg" (c.f.  
> xmlagg in PostgreSQL) because it only works on strings.
> == Choice of separator
> A single comma seems to me to a less than ideal choice.
>  Names are sometimes written "Family, Given"
>  Often you want ", " (two chars) to presentable lists.
> A better choice might be one of the separator characters from  
> Unicode that is unusual then the application can easily do
>  s/separator/myChoice/g
> on the result string to get the form it wants.
> Some possible characters are:
> group separator		001D
> record separator 	001E
> unit separator		001F
> sequence concatenation	2040
> x1D/x1E/x1F all look sensible possible choices.

Agreed, 0x1d-0x1f are the ASCII-inherited control characters of  
course. 0x1d appeals, being the group separator :)

> Alternatively, don't define it and let the application environment  
> make that choice; tests assume a particular setting (I don't like  
> this but I mention it for completeness).
> == Syntax for the separator
> This is the second case of aggregate specific arguments, DISTINCT  
> being the other.  The "aggregate modifiers" define a particular case  
> of the aggregator operation to be applied over the elements of the  
> group. It's like a second-order function - the URI identifies  
> something that when given some argument produces the operation to be  
> applied.
> Proposal 4 captures this.
> ** Proposal 4a
>    GROUP_CONCAT[","](?name)
> as a general aggregate syntax, all aggregates can take an []  
> argument list, including custom aggregates.

I find this syntax very appealing, might be because it's reminiscent  
of TeX though!

The ability to apply it to custom aggregates as well is good. I can't  
quite imagine a clean ORDER BY syntax using this though, could be  
quoted as a string I suppose?

GROUP_CONCAT[",", "DESC(strlen(?x))"](?x, ?y)

Would the [] arguments take only constants? If not there are issues  
around grouping and values, or the [] arguments have to be passed by  
reference too.

A minor variation would be to use named arguments:

GROUP_CONCAT[separator="|", orderby="DESC(strlen(?x))", limit=10](?x)

A bit verbose though.

> We keep the DISTINCT special syntax but it is really another way to  
> write aggregate modifiers.

Yes, it could be equivalent to [distinct=true] in some way, if we went  
down that route, though DISTINCT does apply to the expression list in  
a consistent way, so I don't think there's any real benefit.

> I put it in this order because you can think of it as
> (GROUP_CONCAT[","])(?name)
> where
> yields the specific aggregator for GROUP_CONCAT using ",".

As a sort of currying?

> == Defining GROUP_CONCAT
> As far as I can see, in the current proposed design, groups of no  
> elements can't occur (there would be an error and the that  
> particular group skipped).
> But if we do have a design that allows empty groups, the value of an  
> empty GROUP_CONCAT should be "".  This is fixed.

Agreed. This is implicit in the definition (fn:string-join({}, " ") is  
the empty string), but it's worth spelling out.

> See next message for a proposal where groups of no elements can occur.
> Making the syntax special, means it can't be written as custom  
> aggregate - that might seem minor but I think it's a reflection that  
> we haven't full understood the mechanism yet especially if the  
> custom aggregate mechanism is used to provide variations of the  
> built-in aggregator semantics.

Yes, probably.

> 4/ Semantics and errors/unbound
> unbound and error are the same in SPARQL.
> Suppose we group a citations database by paper, then GROUP_CONCAT  
> the authors of each paper, what happens if one paper has no  
> authors?  As currently suggested, the group row is dropped  
> (execution error preparing the expressions multiset) but returning  
> nothing or "" is more useful.

Yes, definitely. There are other cases that are no so clearcut, but  
I've not read the other mail yet.

- Steve

> On 05/03/2010 11:40 AM, Steve Harris wrote:
>> Hi all,
>> Problem:
>> There's no way to specify a separator string in the draft  
>> aggregate. I have a vague memory that we'd discussed this briefly
>> somewhere, F2F2, or on a call maybe, but it's pretty hazy. This was
>> brought up in Rob Vesse's recent comment.
>> Proposal 1:
>> Leave it as it is. Users cannot specify the separator character, it's
>> fixed in the spec.
>> Upside, very simple. Downside, might limit usefulness.
>> Probably should make sure there's an escaping function in SPARQL 1.1
>> that's compatible with the character.
>> Proposal 2:
>> If the GROUP_CONCAT expression list has more than one element, then  
>> the
>> lexically last one is removed and used as the separator before being
>> passed to the Aggregation() algebra function. e.g. GROUP_CONCAT(? 
>> x, ?y,
>> "|")
>> Upside, keeps the grammar simple. Downside makes the algebra around
>> GROUP_CONCAT weird, might be surprising as the multi-expression
>> behaviour will be different to other aggregates.
>> e.g. in GROUP_CONCAT(?x, ?y) ?y will be an argument to the underlying
>> function, not an expression. Would probably have to pick a value  
>> of ?y
>> to random, a la SAMPLE(), as we don't require that "arguments" to
>> aggregates are scalar.
>> Proposal 3:
>> Use MySQL syntax to specify it, i.e. GROUP_CONCAT(?x, ?y SEPARATOR  
>> "|").
>> Upside, the same as MySQL (where GROUP_CONCAT comes from), avoids
>> weirding algebra. Downside, makes the grammar more complex.
>> Proposal 4:
>> Like 3, but with some other explicit syntax. e.g. GROUP_CONCAT(?x,
>> ?y)[SEPARATOR "|"]
>> Upside, avoids weirding algebra. Downside, we have to think of our  
>> own
>> syntax, no familiarity for MySQL users and probably makes the grammar
>> more complex.
>> ---
>> My opinion:
>> I'd take 3, or 1 happily, but I think 4 is a bit arbitrary, and 2 is
>> really nasty.
>> There's also other useful syntax around GROUP_CONCAT, e.g. ORDER  
>> BY, so
>> I expect a future SPARQL will end up with something like 3 or 4  
>> anyway.
>> - Steve

Steve Harris, Garlik Limited
2 Sheen Road, Richmond, TW9 1AE, UK
+44 20 8973 2465
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  

Received on Sunday, 7 March 2010 19:49:18 UTC