- From: Steve Harris <steve.harris@garlik.com>
- Date: Sun, 7 Mar 2010 19:48:48 +0000
- To: Andy Seaborne <andy.seaborne@talis.com>
- Cc: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
On 7 Mar 2010, at 17:34, Andy Seaborne wrote: > > GROUP_CONCAT brings up a number of issues of how aggregates actually > work. I've separated out an email about aggregates in general which > includes a proposal and some test cases. > > > Some GROUP_CONCAT specific comments: > > == Name: I prefer something like "stragg" or "stringAgg" (c.f. > xmlagg in PostgreSQL) because it only works on strings. > > == Choice of separator > > A single comma seems to me to a less than ideal choice. > Names are sometimes written "Family, Given" > Often you want ", " (two chars) to presentable lists. > > A better choice might be one of the separator characters from > Unicode that is unusual then the application can easily do > > s/separator/myChoice/g > > on the result string to get the form it wants. > > Some possible characters are: > > INVISIBLE SEPARATOR 2063 > group separator 001D > record separator 001E > unit separator 001F > sequence concatenation 2040 > > x1D/x1E/x1F all look sensible possible choices. Agreed, 0x1d-0x1f are the ASCII-inherited control characters of course. 0x1d appeals, being the group separator :) > Alternatively, don't define it and let the application environment > make that choice; tests assume a particular setting (I don't like > this but I mention it for completeness). > > == Syntax for the separator > > This is the second case of aggregate specific arguments, DISTINCT > being the other. The "aggregate modifiers" define a particular case > of the aggregator operation to be applied over the elements of the > group. It's like a second-order function - the URI identifies > something that when given some argument produces the operation to be > applied. > > Proposal 4 captures this. > > ** Proposal 4a > > GROUP_CONCAT[","](?name) > > as a general aggregate syntax, all aggregates can take an [] > argument list, including custom aggregates. I find this syntax very appealing, might be because it's reminiscent of TeX though! The ability to apply it to custom aggregates as well is good. I can't quite imagine a clean ORDER BY syntax using this though, could be quoted as a string I suppose? GROUP_CONCAT[",", "DESC(strlen(?x))"](?x, ?y) Would the [] arguments take only constants? If not there are issues around grouping and values, or the [] arguments have to be passed by reference too. A minor variation would be to use named arguments: GROUP_CONCAT[separator="|", orderby="DESC(strlen(?x))", limit=10](?x) A bit verbose though. > We keep the DISTINCT special syntax but it is really another way to > write aggregate modifiers. Yes, it could be equivalent to [distinct=true] in some way, if we went down that route, though DISTINCT does apply to the expression list in a consistent way, so I don't think there's any real benefit. > I put it in this order because you can think of it as > > (GROUP_CONCAT[","])(?name) > > where > > (GROUP_CONCAT[","]) > > yields the specific aggregator for GROUP_CONCAT using ",". As a sort of currying? > == Defining GROUP_CONCAT > > As far as I can see, in the current proposed design, groups of no > elements can't occur (there would be an error and the that > particular group skipped). > > But if we do have a design that allows empty groups, the value of an > empty GROUP_CONCAT should be "". This is fixed. Agreed. This is implicit in the definition (fn:string-join({}, " ") is the empty string), but it's worth spelling out. > See next message for a proposal where groups of no elements can occur. > > Making the syntax special, means it can't be written as custom > aggregate - that might seem minor but I think it's a reflection that > we haven't full understood the mechanism yet especially if the > custom aggregate mechanism is used to provide variations of the > built-in aggregator semantics. Yes, probably. > 4/ Semantics and errors/unbound > > unbound and error are the same in SPARQL. > > Suppose we group a citations database by paper, then GROUP_CONCAT > the authors of each paper, what happens if one paper has no > authors? As currently suggested, the group row is dropped > (execution error preparing the expressions multiset) but returning > nothing or "" is more useful. Yes, definitely. There are other cases that are no so clearcut, but I've not read the other mail yet. - Steve > On 05/03/2010 11:40 AM, Steve Harris wrote: >> Hi all, >> >> Problem: >> >> There's no way to specify a separator string in the draft >> GROUP_CONCAT >> aggregate. I have a vague memory that we'd discussed this briefly >> somewhere, F2F2, or on a call maybe, but it's pretty hazy. This was >> brought up in Rob Vesse's recent comment. >> >> Proposal 1: >> >> Leave it as it is. Users cannot specify the separator character, it's >> fixed in the spec. >> >> Upside, very simple. Downside, might limit usefulness. >> >> Probably should make sure there's an escaping function in SPARQL 1.1 >> that's compatible with the character. >> >> Proposal 2: >> >> If the GROUP_CONCAT expression list has more than one element, then >> the >> lexically last one is removed and used as the separator before being >> passed to the Aggregation() algebra function. e.g. GROUP_CONCAT(? >> x, ?y, >> "|") >> >> Upside, keeps the grammar simple. Downside makes the algebra around >> GROUP_CONCAT weird, might be surprising as the multi-expression >> behaviour will be different to other aggregates. >> >> e.g. in GROUP_CONCAT(?x, ?y) ?y will be an argument to the underlying >> function, not an expression. Would probably have to pick a value >> of ?y >> to random, a la SAMPLE(), as we don't require that "arguments" to >> aggregates are scalar. >> >> Proposal 3: >> >> Use MySQL syntax to specify it, i.e. GROUP_CONCAT(?x, ?y SEPARATOR >> "|"). >> >> Upside, the same as MySQL (where GROUP_CONCAT comes from), avoids >> weirding algebra. Downside, makes the grammar more complex. >> >> Proposal 4: >> >> Like 3, but with some other explicit syntax. e.g. GROUP_CONCAT(?x, >> ?y)[SEPARATOR "|"] >> >> Upside, avoids weirding algebra. Downside, we have to think of our >> own >> syntax, no familiarity for MySQL users and probably makes the grammar >> more complex. >> >> --- >> >> My opinion: >> >> I'd take 3, or 1 happily, but I think 4 is a bit arbitrary, and 2 is >> really nasty. >> >> There's also other useful syntax around GROUP_CONCAT, e.g. ORDER >> BY, so >> I expect a future SPARQL will end up with something like 3 or 4 >> anyway. >> >> - Steve >> -- Steve Harris, Garlik Limited 2 Sheen Road, Richmond, TW9 1AE, UK +44 20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Sunday, 7 March 2010 19:49:18 UTC