Re: Separator string in GROUP_CONCAT() from Steve Harris on 2010-03-07 (public-rdf-dawg@w3.org from January to March 2010)

From: Steve Harris <steve.harris@garlik.com>
Date: Sun, 7 Mar 2010 19:48:48 +0000
To: Andy Seaborne <andy.seaborne@talis.com>
Cc: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
Message-Id: <BE228624-078B-4AAC-B183-A29250631D54@garlik.com>
On 7 Mar 2010, at 17:34, Andy Seaborne wrote:
>
> GROUP_CONCAT brings up a number of issues of how aggregates actually  
> work.  I've separated out an email about aggregates in general which  
> includes a proposal and some test cases.
>
>
> Some GROUP_CONCAT specific comments:
>
> == Name: I prefer something like "stragg" or "stringAgg" (c.f.  
> xmlagg in PostgreSQL) because it only works on strings.
>
> == Choice of separator
>
> A single comma seems to me to a less than ideal choice.
>  Names are sometimes written "Family, Given"
>  Often you want ", " (two chars) to presentable lists.
>
> A better choice might be one of the separator characters from  
> Unicode that is unusual then the application can easily do
>
>  s/separator/myChoice/g
>
> on the result string to get the form it wants.
>
> Some possible characters are:
>
> INVISIBLE SEPARATOR	2063
> group separator		001D
> record separator 	001E
> unit separator		001F
> sequence concatenation	2040
>
> x1D/x1E/x1F all look sensible possible choices.

Agreed, 0x1d-0x1f are the ASCII-inherited control characters of  
course. 0x1d appeals, being the group separator :)

> Alternatively, don't define it and let the application environment  
> make that choice; tests assume a particular setting (I don't like  
> this but I mention it for completeness).
>
> == Syntax for the separator
>
> This is the second case of aggregate specific arguments, DISTINCT  
> being the other.  The "aggregate modifiers" define a particular case  
> of the aggregator operation to be applied over the elements of the  
> group. It's like a second-order function - the URI identifies  
> something that when given some argument produces the operation to be  
> applied.
>
> Proposal 4 captures this.
>
> ** Proposal 4a
>
>    GROUP_CONCAT[","](?name)
>
> as a general aggregate syntax, all aggregates can take an []  
> argument list, including custom aggregates.

I find this syntax very appealing, might be because it's reminiscent  
of TeX though!

The ability to apply it to custom aggregates as well is good. I can't  
quite imagine a clean ORDER BY syntax using this though, could be  
quoted as a string I suppose?

GROUP_CONCAT[",", "DESC(strlen(?x))"](?x, ?y)

Would the [] arguments take only constants? If not there are issues  
around grouping and values, or the [] arguments have to be passed by  
reference too.

A minor variation would be to use named arguments:

GROUP_CONCAT[separator="|", orderby="DESC(strlen(?x))", limit=10](?x)

A bit verbose though.

> We keep the DISTINCT special syntax but it is really another way to  
> write aggregate modifiers.

Yes, it could be equivalent to [distinct=true] in some way, if we went  
down that route, though DISTINCT does apply to the expression list in  
a consistent way, so I don't think there's any real benefit.

> I put it in this order because you can think of it as
>
> (GROUP_CONCAT[","])(?name)
>
> where
>
> (GROUP_CONCAT[","])
>
> yields the specific aggregator for GROUP_CONCAT using ",".

As a sort of currying?

> == Defining GROUP_CONCAT
>
> As far as I can see, in the current proposed design, groups of no  
> elements can't occur (there would be an error and the that  
> particular group skipped).
>
> But if we do have a design that allows empty groups, the value of an  
> empty GROUP_CONCAT should be "".  This is fixed.

Agreed. This is implicit in the definition (fn:string-join({}, " ") is  
the empty string), but it's worth spelling out.

> See next message for a proposal where groups of no elements can occur.
>
> Making the syntax special, means it can't be written as custom  
> aggregate - that might seem minor but I think it's a reflection that  
> we haven't full understood the mechanism yet especially if the  
> custom aggregate mechanism is used to provide variations of the  
> built-in aggregator semantics.

Yes, probably.

> 4/ Semantics and errors/unbound
>
> unbound and error are the same in SPARQL.
>
> Suppose we group a citations database by paper, then GROUP_CONCAT  
> the authors of each paper, what happens if one paper has no  
> authors?  As currently suggested, the group row is dropped  
> (execution error preparing the expressions multiset) but returning  
> nothing or "" is more useful.

Yes, definitely. There are other cases that are no so clearcut, but  
I've not read the other mail yet.

- Steve

> On 05/03/2010 11:40 AM, Steve Harris wrote:
>> Hi all,
>>
>> Problem:
>>
>> There's no way to specify a separator string in the draft  
>> GROUP_CONCAT
>> aggregate. I have a vague memory that we'd discussed this briefly
>> somewhere, F2F2, or on a call maybe, but it's pretty hazy. This was
>> brought up in Rob Vesse's recent comment.
>>
>> Proposal 1:
>>
>> Leave it as it is. Users cannot specify the separator character, it's
>> fixed in the spec.
>>
>> Upside, very simple. Downside, might limit usefulness.
>>
>> Probably should make sure there's an escaping function in SPARQL 1.1
>> that's compatible with the character.
>>
>> Proposal 2:
>>
>> If the GROUP_CONCAT expression list has more than one element, then  
>> the
>> lexically last one is removed and used as the separator before being
>> passed to the Aggregation() algebra function. e.g. GROUP_CONCAT(? 
>> x, ?y,
>> "|")
>>
>> Upside, keeps the grammar simple. Downside makes the algebra around
>> GROUP_CONCAT weird, might be surprising as the multi-expression
>> behaviour will be different to other aggregates.
>>
>> e.g. in GROUP_CONCAT(?x, ?y) ?y will be an argument to the underlying
>> function, not an expression. Would probably have to pick a value  
>> of ?y
>> to random, a la SAMPLE(), as we don't require that "arguments" to
>> aggregates are scalar.
>>
>> Proposal 3:
>>
>> Use MySQL syntax to specify it, i.e. GROUP_CONCAT(?x, ?y SEPARATOR  
>> "|").
>>
>> Upside, the same as MySQL (where GROUP_CONCAT comes from), avoids
>> weirding algebra. Downside, makes the grammar more complex.
>>
>> Proposal 4:
>>
>> Like 3, but with some other explicit syntax. e.g. GROUP_CONCAT(?x,
>> ?y)[SEPARATOR "|"]
>>
>> Upside, avoids weirding algebra. Downside, we have to think of our  
>> own
>> syntax, no familiarity for MySQL users and probably makes the grammar
>> more complex.
>>
>> ---
>>
>> My opinion:
>>
>> I'd take 3, or 1 happily, but I think 4 is a bit arbitrary, and 2 is
>> really nasty.
>>
>> There's also other useful syntax around GROUP_CONCAT, e.g. ORDER  
>> BY, so
>> I expect a future SPARQL will end up with something like 3 or 4  
>> anyway.
>>
>> - Steve
>>

-- 
Steve Harris, Garlik Limited
2 Sheen Road, Richmond, TW9 1AE, UK
+44 20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  
9AD
Received on Sunday, 7 March 2010 19:49:18 UTC