Re: Semantics of SUM from Steve Harris on 2009-11-13 (public-rdf-dawg@w3.org from October to December 2009)

From: Steve Harris <steve.harris@garlik.com>
Date: Fri, 13 Nov 2009 15:47:06 +0000
To: Andy Seaborne <andy.seaborne@talis.com>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <1CA51700-66AE-4A19-9750-976A2C9F11D1@garlik.com>
On 13 Nov 2009, at 11:47, Andy Seaborne wrote:
>>>>> but
>>>>>
>>>>> SUM(COALESCE(xsd:double(?x), 0) | ?x{1,2,3} ) => "6"^^xsd:double
>>>
>>> Using
>>> SUM(COALESCE(xsd:double(?x), 0))
>>> for group members ?x=1,2,3 (xsd:integers) results in xsd:double, not
>>> xsd:integer as it would for 1+2+3
>>
>> No, sure, but COALESCE(?x, 0) would in this case.
>
> But then we'd loose the strings->integers.

Sure, there are may options depending on exactly what behaviour you  
want.

> In SQL, the standard aggregates ignore nulls so my assumption has  
> been that SPARQL would have been similar which I was assuming was  
> ignore unbound.  Because evaluating ?x, when ?x is unbound, is an  
> error, I assumed that meant ignore error was the consistent paradigm.

There is a difference. NULLs are not passed down to the aggregate  
operator (except in the case of COUNT(*)), whereas type errors can  
occur during evaluation.

Obviously SPARQL does not have to follow the SQL model, but we  
probably should unless there's a good reason.

>>>>> Using xsd:integer does not work e.g. 1, 1.5, 3 !=>  
>>>>> "5.5"^^xsd:decimal
>>>
>>> Encoding the full XSD hierarchy to minimise the promotion as + used
>>> natively does, and cope with errors/non-numbers might be:
>>>
>>> COALESCE(xsd:integer(?x), xsd:decimal(?x), xsd:double(?x), 0)
>>>
>>> which I don't think is a serious contender on practical grounds.
>>
>> No, but I don't imagine there are too many situations where users  
>> would
>> want the exact semantics, do you have good examples?
>
> Maybe not perfect semantics but adding integers should produce  
> integers.  My experience is that users don't always see that it's  
> the value that matter.

Sure, but there are also usecases where you don't know what types will  
be used (if any), and you want to know if there were any type errors.

> Promoting to doubles loose precision eventually.  Finance  
> applications may care.

Sure, we have decimal for financial apps.

>>> I think ending up with xsd:double, not xsd:integer, in the potential
>>> present of errors, is significant so maybe a way to indicate that
>>> errors be excluded might be a better. However, a plethora of options
>>> is also bad design. Hmm.
>>
>> I think you could do that with COALESCE(?x), if none of the values
>> passed to COALESCE are not type errors or bound, you get an unbound
>> value returned, so it will be dropped. I think.
>
> Dropped? But the semantics are to use + and "unbound+anything" is an  
> error, which then invalidates the whole SUM.

Depending on whether we say that the AGG() function removes unbound  
values before passing to the expression in it's argument.

> Unbound and error need to be treated the same way because  
> eval(unbound variable) is an error.  (BOUND(?x) is special - it only  
> applies to variables).

Or, they need to be treated differently, because people might care  
about the difference?

> If errors are just dropped in SUM, it all works out.  nulls are  
> dropped in SQL SUM.

In SQL, NULLs are dropped everywhere except COUNT(*), before the  
aggregate function is invoked.

[examples use MySQL]

 > SELECT * FROM C;
+------+
| lex  |
+------+
| 5    |
| 23   |
| NULL |
+------+

 > SELECT COUNT(*) FROM C;
+----------+
| COUNT(*) |
+----------+
|        3 |
+----------+

 > SELECT COUNT(lex) FROM C;
+------------+
| COUNT(lex) |
+------------+
|          2 |
+------------+

 > SELECT SUM(CAST(lex AS unsigned)) FROM C;
+----------------------------+
| SUM(CAST(lex AS unsigned)) |
+----------------------------+
|                         28 |
+----------------------------+
[note: no warnings]

- Steve
Received on Friday, 13 November 2009 15:47:35 UTC