Re: Separator string in GROUP_CONCAT() from Steve Harris on 2010-03-08 (public-rdf-dawg@w3.org from January to March 2010)

From: Steve Harris <steve.harris@garlik.com>
Date: Mon, 8 Mar 2010 14:44:41 +0000
To: Andy Seaborne <andy.seaborne@talis.com>
Cc: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
Message-Id: <081A8A61-C5E4-4F5C-81F5-36DC5612E453@garlik.com>
On 8 Mar 2010, at 10:46, Andy Seaborne wrote:
>
> On 07/03/2010 7:48 PM, Steve Harris wrote:
>> On 7 Mar 2010, at 17:34, Andy Seaborne wrote:
>>>
>
>>> Some possible characters are:
>>>
>>> INVISIBLE SEPARATOR 2063
>>> group separator 001D
>>> record separator 001E
>>> unit separator 001F
>>> sequence concatenation 2040
>>>
>>> x1D/x1E/x1F all look sensible possible choices.
>>
>> Agreed, 0x1d-0x1f are the ASCII-inherited control characters of  
>> course.
>> 0x1d appeals, being the group separator :)
>
> If we go this way, then we will need to check the detailed  
> documentation of each.

Yup. I believe that they're hierarchical, 0x1f being the deepest/ 
finest level.
http://en.wikipedia.org/wiki/C0_and_C1_control_codes appears to back  
this up.

I have a nagging concern that people will just assume that the  
separator character won't appear in text, and won't bother to escape,  
which of course RDF doesn't guarantee.

>>> Proposal 4 captures this.
>>>
>>> ** Proposal 4a
>>>
>>> GROUP_CONCAT[","](?name)
>>>
>>> as a general aggregate syntax, all aggregates can take an []  
>>> argument
>>> list, including custom aggregates.
>>
>> I find this syntax very appealing, might be because it's  
>> reminiscent of
>> TeX though!
>>
>> The ability to apply it to custom aggregates as well is good. I can't
>> quite imagine a clean ORDER BY syntax using this though, could be  
>> quoted
>> as a string I suppose?
>>
>> GROUP_CONCAT[",", "DESC(strlen(?x))"](?x, ?y)
>
> PostgreSQL recommends a subquery for the nearest equivalent:
>
> SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab;
>
> so GROUP_CONCAT over a pre setup intemediate result gives the full  
> capability of ORDER BY LIMIt etc without
>
> SELECT GROUP_CONCAT(?x)
> {
>   SELECT ?x
>   {
>      ....
>   }
>   ORDER BY
>   LIMIT
> }
>
> We would need to tighten up the spec text aroudn subqueries to make  
> this perfect
>
> ** Suggestion: This exactly one subSELECT query pattern of an outer  
> SELECT preserves order.

I'll have to ruminate on that, it seems a little strange at first  
glance, but is a pretty likely behaviour of many systems anyway.  
Probably worth a separate discussion.

>> Would the [] arguments take only constants? If not there are issues
>> around grouping and values, or the [] arguments have to be passed by
>> reference too.
>
> I was assuming they would be expressions with variables limited to  
> group key variables or outer variables (i.e. the ones that are the  
> same for the whole of each partition) and can be eval'ed before the  
> partitioning starts.

Or, presumably the result of an aggregate, e.g. SAMPLE()? I.e. the  
same rules as projected variables.

>> A minor variation would be to use named arguments:
>>
>> GROUP_CONCAT[separator="|", orderby="DESC(strlen(?x))", limit=10](?x)
>>
>> A bit verbose though.
>
> Have you been Ruby programming recently? :-)

Heh, no, I haven't gone over to the light side. People do this in Perl  
too you know :)

>>> We keep the DISTINCT special syntax but it is really another way to
>>> write aggregate modifiers.
>>
>> Yes, it could be equivalent to [distinct=true] in some way, if we  
>> went
>> down that route, though DISTINCT does apply to the expression list  
>> in a
>> consistent way, so I don't think there's any real benefit.
>
> Yes - the DISTINCT case is so well know we ought to at least copy  
> the syntax as a special case.

Right.

>>> I put it in this order because you can think of it as
>>>
>>> (GROUP_CONCAT[","])(?name)
>>>
>>> where
>>>
>>> (GROUP_CONCAT[","])
>>>
>>> yields the specific aggregator for GROUP_CONCAT using ",".
>>
>> As a sort of currying?
>
> Or Schönfinkelisation [1]
>
> It's like it, with the restriction that all the GROUP_CONCAT  
> aggregate argument must be defined.  Can't have:
>
> ((GROUP_CONCAT[","])[limit=10)(?name)
>
> to apply the first, get a new, partially ground aggregator and then  
> ground it further.

Fair enough.

For the record, my (marginal) preference for named arguments is for  
forward compatibility. If we allow/encourage [] as an extension  
mechanism to built in functions (eg. limit in GROUP_CONCAT), then  
relying on argument position is a bit risky, as different  
implementations will put the arguments in different orders.

It's still potentially ambiguous because "limit" could be a char limit  
(as in MySQL) or a result limit, or a set size limit.  
[<uri>=<value>, ...] just isn't that tempting though :)

- Steve

-- 
Steve Harris, Garlik Limited
2 Sheen Road, Richmond, TW9 1AE, UK
+44 20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  
9AD
Received on Monday, 8 March 2010 14:45:11 UTC