Re: SUM aggregate operator and non-numeric literals from Steve Harris on 2011-10-19 (public-rdf-dawg-comments@w3.org from October 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Wed, 19 Oct 2011 15:38:17 +0100
To: public-rdf-dawg-comments@w3.org
Cc: Jeen Broekstra <jeen.broekstra@gmail.com>
Message-Id: <750B3AED-7B5D-4CE0-91B3-EB189C0275C8@garlik.com>

Jeen Broekstra:

 > Hi DAWG,
 >
 > The current definition of SUM (section 18.4) is as follows :
 >
 > ==begin quote==
 > Definition: Sum
 > numeric Sum(multiset M)
 >
 > The Sum set function is used by the SUM aggregate in the syntax.
 >
 > Sum(M) = Sum(ToList(Flatten(M))).
 >
 > Sum(S) = op:numeric-add(S1, Sum(S2..n)) when card[S] > 1
 > Sum(S) = op:numeric-add(S1, 0) when card[S] = 1
 > Sum(S) = 0 when card[S] = 0
 >
 > In this way, Sum({1, 2, 3}) = op:numeric-add(1, op:numeric-add(2, op:numeric-add(3, 0))).
 > ==end quote==
 >
 > Given that the definition of SUM is directly in terms of the op:numeric-add
 > XPath function, it follows that it can only be applied on numeric literals.
 > Therefore, any SUM that aggregates over a set of values that contains a
 > non-numeric type will result in a type error. Not even an extension of the
 > SPARQL operator table in section 17.3 will help, as SUM is not defined in terms
 > of those operators.

That is correct, however the Sum() set function may be extended in the same way as the + operator, given that any extensible arguments will currently return a type error.

We have added some text to the document (section 17.6 of the current draft) making it clearer that any function/operator which returns a type error may be extended by implementations, this has become less clear since the 1.0 version.

 > In other words, if we have the following data:
 >
 > :a rdf:value "1" .
 > :a rdf:value "2"^^xsd:integer .
 > :b rdf:value "3"^^xsd:integer .
 >
 > And the following query:
 >
 > SELECT (SUM(?val) as ?value)
 > WHERE {
 >   ?a rdf:value ?val .
 > } GROUP BY ?a
 >
 > The result will be always a type error.

Correct. That was felt to be the most generally helpful option by the Working Group. There's a tradeoff between returning an incorrect, and potentially dangerous result, and trying to handle poor quality data. It was felt that the safest option, given that people may be using SPARQL to process financial data, was to be conservative in what values the aggregates accept.

If you wish all values to be treated as integers, then it's possible to use

SUM(xsd:integer(?val))

and if you wish to ignore things which can't be interpreted as integers then

SUM(COALESCE(xsd:integer(?val), 0))

will ignore values which can't be cast. In the most forgiving scenario, an expression like:

SUM(IF(isNumeric(?val), ?val, COALESCE(xsd:double(?val),0))

will accept anything that is, or can be cast to a numeric value, and ignore things that can't be.

 > I would argue that having the same extensibility mechanisms available for SUM
 > as we have for, for example, the + operator would be preferable. That way,
 > implementations wanting to offer a more forgiving version of the SUM operator
 > (one which silently ignores the non-numerics, for example), could do so while
 > staying spec-compliant.

There is a potential conflict with scalar extensions to the + operator, which may not be transitive, e.g.
http://www.w3.org/TR/xmlschema-2/#adding-durations-to-dateTimes
where xsd:dataTime + xsd:duration → xsd:dateTime

So, on balance we feel that it's probably better to keep extension possibilities for Sum() and + separate.

Please respond indicating whether you feel this response has answered your query.

Regards,
   Steve, on behalf of the SPARQL Working Group.

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Received on Wednesday, 19 October 2011 14:38:43 UTC