Re: ACTION-24: aggregate functions with multiple answers from Lee Feigenbaum on 2009-05-11 (public-rdf-dawg@w3.org from April to June 2009)

From: Lee Feigenbaum <lee@thefigtrees.net>
Date: Mon, 11 May 2009 11:27:52 -0400
To: "Seaborne, Andy" <andy.seaborne@hp.com>
CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4A0843F8.8020309@thefigtrees.net>

Seaborne, Andy wrote:
> ACTION-24: Explain potential design regarding aggregate functions with multiple answers for mixed datatypes re ISSUE-16
> 
> 
> Aggregate functions like MIN, MAX, SUM - in fact, most except COUNT - operate on the value space of a variable binding or the value of an expression, which is a value. Even COUNT of terms (COUNT(?x)) needs to deal with unbound variables (by skipping?).
> 
> SUM(?x) requires that ?x is numeric, presumably according to the type promotion rules for XSD arithmetic operations.
> 
> http://www.w3.org/TR/xpath-functions/#op.numeric
> 
> What happens if SUM encounters a numeric value, such as a string or date or unbound?  Because SUM works on a single value space, simply ignoring nonsensical values is a possibly design.
> 
> But MIN and MAX are different in that they have answers in different values spaces.  MIN over numbers gives a number, MIN over dateTimes gives a dateTimes etc etc.
> 
> Data in RDF can be of mixed datatypes: experience with data, especially combined from different sources, shows that representations can vary.  In one place a dc:date property might be an XSD date, but elsewhere it might be a string (all too common).
> 
> If the type for the MIN operation is known by the application, then it can explicitly cast: e.g. MIN(xsd:date(?x)).  But we also need to consider what happens when the application does not force the datatype.  MIN and MAX need to deal with incompatible data.
> 
> Choices for dealing with this include:
> 
> 1/ The value space for MIN is the value space of the first encountered datatype and everything incompatible is ignored.
> 
> 2/ The value space has to be given - there is no single "MIN" operation:
> e.g. MIN(xsd:dateTime, ?x)
> 
> 3/ There is one answer per group for each datatype encountered in the group.  This means multiple rows per group.
> 
> 4/ Error.  No query results at all.

Thanks, Andy. I'd like to suggest that there's a 5th option as well 
(which is what Glitter currently does):

5/ MIN and MAX are defined as per ORDER BY in the existing spec ( 
http://www.w3.org/TR/rdf-sparql-query/#modOrderBy ) - for ORDER BY, the 
spec augments the '<' operator with a relative ordering of types of RDF 
terms. This does not provide a total ordering, and the spec. explicitly 
says that orderings in the unspecified cases are undefined.

Effectively, (5) is saying to define MIN(?x) as the value of ?x in the 
solution given by processing the group of solutions via ORDER BY ASC(?x) 
LIMIT 1.

Lee


>  
> Even in (3), literals of unknown (not understood by this process) datatype, and unbounds, would be ignored.  Warnings up to the implementation but the results are the same for all processors.
> 
> (1) has the unfortunate effect that the answer can change depending on the order data is encountered, so isn't fixed even for a single query processor.
> 
> (4) is hard for scaling - the error may be encountered at the end of the data when some results were ready much earlier but can't be sent until the query is known to be successful.  An effect of HTTP requiring the return code first - "200 OK" is seen as promising results, not an error half way through.
> 
>  Andy
> 
> 
> --------------------------------------------
>   Hewlett-Packard Limited
>   Registered Office: Cain Road, Bracknell, Berks RG12 1HN
>   Registered No: 690597 England
>

Received on Monday, 11 May 2009 15:28:38 UTC