- From: Andy Seaborne <andy.seaborne@talis.com>
- Date: Sun, 07 Mar 2010 17:34:42 +0000
- To: Steve Harris <steve.harris@garlik.com>
- CC: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
GROUP_CONCAT brings up a number of issues of how aggregates actually work. I've separated out an email about aggregates in general which includes a proposal and some test cases. Some GROUP_CONCAT specific comments: == Name: I prefer something like "stragg" or "stringAgg" (c.f. xmlagg in PostgreSQL) because it only works on strings. == Choice of separator A single comma seems to me to a less than ideal choice. Names are sometimes written "Family, Given" Often you want ", " (two chars) to presentable lists. A better choice might be one of the separator characters from Unicode that is unusual then the application can easily do s/separator/myChoice/g on the result string to get the form it wants. Some possible characters are: INVISIBLE SEPARATOR 2063 group separator 001D record separator 001E unit separator 001F sequence concatenation 2040 x1D/x1E/x1F all look sensible possible choices. Alternatively, don't define it and let the application environment make that choice; tests assume a particular setting (I don't like this but I mention it for completeness). == Syntax for the separator This is the second case of aggregate specific arguments, DISTINCT being the other. The "aggregate modifiers" define a particular case of the aggregator operation to be applied over the elements of the group. It's like a second-order function - the URI identifies something that when given some argument produces the operation to be applied. Proposal 4 captures this. ** Proposal 4a GROUP_CONCAT[","](?name) as a general aggregate syntax, all aggregates can take an [] argument list, including custom aggregates. We keep the DISTINCT special syntax but it is really another way to write aggregate modifiers. I put it in this order because you can think of it as (GROUP_CONCAT[","])(?name) where (GROUP_CONCAT[","]) yields the specific aggregator for GROUP_CONCAT using ",". == Defining GROUP_CONCAT As far as I can see, in the current proposed design, groups of no elements can't occur (there would be an error and the that particular group skipped). But if we do have a design that allows empty groups, the value of an empty GROUP_CONCAT should be "". This is fixed. See next message for a proposal where groups of no elements can occur. Making the syntax special, means it can't be written as custom aggregate - that might seem minor but I think it's a reflection that we haven't full understood the mechanism yet especially if the custom aggregate mechanism is used to provide variations of the built-in aggregator semantics. 4/ Semantics and errors/unbound unbound and error are the same in SPARQL. Suppose we group a citations database by paper, then GROUP_CONCAT the authors of each paper, what happens if one paper has no authors? As currently suggested, the group row is dropped (execution error preparing the expressions multiset) but returning nothing or "" is more useful. This is covered in the general email and proposal for aggregation. Andy On 05/03/2010 11:40 AM, Steve Harris wrote: > Hi all, > > Problem: > > There's no way to specify a separator string in the draft GROUP_CONCAT > aggregate. I have a vague memory that we'd discussed this briefly > somewhere, F2F2, or on a call maybe, but it's pretty hazy. This was > brought up in Rob Vesse's recent comment. > > Proposal 1: > > Leave it as it is. Users cannot specify the separator character, it's > fixed in the spec. > > Upside, very simple. Downside, might limit usefulness. > > Probably should make sure there's an escaping function in SPARQL 1.1 > that's compatible with the character. > > Proposal 2: > > If the GROUP_CONCAT expression list has more than one element, then the > lexically last one is removed and used as the separator before being > passed to the Aggregation() algebra function. e.g. GROUP_CONCAT(?x, ?y, > "|") > > Upside, keeps the grammar simple. Downside makes the algebra around > GROUP_CONCAT weird, might be surprising as the multi-expression > behaviour will be different to other aggregates. > > e.g. in GROUP_CONCAT(?x, ?y) ?y will be an argument to the underlying > function, not an expression. Would probably have to pick a value of ?y > to random, a la SAMPLE(), as we don't require that "arguments" to > aggregates are scalar. > > Proposal 3: > > Use MySQL syntax to specify it, i.e. GROUP_CONCAT(?x, ?y SEPARATOR "|"). > > Upside, the same as MySQL (where GROUP_CONCAT comes from), avoids > weirding algebra. Downside, makes the grammar more complex. > > Proposal 4: > > Like 3, but with some other explicit syntax. e.g. GROUP_CONCAT(?x, > ?y)[SEPARATOR "|"] > > Upside, avoids weirding algebra. Downside, we have to think of our own > syntax, no familiarity for MySQL users and probably makes the grammar > more complex. > > --- > > My opinion: > > I'd take 3, or 1 happily, but I think 4 is a bit arbitrary, and 2 is > really nasty. > > There's also other useful syntax around GROUP_CONCAT, e.g. ORDER BY, so > I expect a future SPARQL will end up with something like 3 or 4 anyway. > > - Steve >
Received on Sunday, 7 March 2010 17:35:18 UTC