ACTION-24: aggregate functions with multiple answers from Seaborne, Andy on 2009-05-11 (public-rdf-dawg@w3.org from April to June 2009)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Mon, 11 May 2009 14:36:00 +0000
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <B6CF1054FDC8B845BF93A6645D19BEA362D1E54359@GVW1118EXC.americas.hpqcorp.net>

ACTION-24: Explain potential design regarding aggregate functions with multiple answers for mixed datatypes re ISSUE-16

Aggregate functions like MIN, MAX, SUM - in fact, most except COUNT - operate on the value space of a variable binding or the value of an expression, which is a value. Even COUNT of terms (COUNT(?x)) needs to deal with unbound variables (by skipping?).

SUM(?x) requires that ?x is numeric, presumably according to the type promotion rules for XSD arithmetic operations.

http://www.w3.org/TR/xpath-functions/#op.numeric

What happens if SUM encounters a numeric value, such as a string or date or unbound? Because SUM works on a single value space, simply ignoring nonsensical values is a possibly design.

But MIN and MAX are different in that they have answers in different values spaces. MIN over numbers gives a number, MIN over dateTimes gives a dateTimes etc etc.

Data in RDF can be of mixed datatypes: experience with data, especially combined from different sources, shows that representations can vary. In one place a dc:date property might be an XSD date, but elsewhere it might be a string (all too common).

If the type for the MIN operation is known by the application, then it can explicitly cast: e.g. MIN(xsd:date(?x)). But we also need to consider what happens when the application does not force the datatype. MIN and MAX need to deal with incompatible data.

Choices for dealing with this include:

1/ The value space for MIN is the value space of the first encountered datatype and everything incompatible is ignored.

2/ The value space has to be given - there is no single "MIN" operation:
e.g. MIN(xsd:dateTime, ?x)

3/ There is one answer per group for each datatype encountered in the group. This means multiple rows per group.

4/ Error. No query results at all.

Even in (3), literals of unknown (not understood by this process) datatype, and unbounds, would be ignored. Warnings up to the implementation but the results are the same for all processors.

(1) has the unfortunate effect that the answer can change depending on the order data is encountered, so isn't fixed even for a single query processor.

(4) is hard for scaling - the error may be encountered at the end of the data when some results were ready much earlier but can't be sent until the query is known to be successful. An effect of HTTP requiring the return code first - "200 OK" is seen as promising results, not an error half way through.

Andy

--------------------------------------------
Hewlett-Packard Limited
Registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England

Received on Monday, 11 May 2009 14:36:53 UTC