Re: ACTION-24: aggregate functions with multiple answers from Steve Harris on 2009-05-12 (public-rdf-dawg@w3.org from April to June 2009)

From: Steve Harris <steve.harris@garlik.com>
Date: Tue, 12 May 2009 10:44:08 +0100
To: "Seaborne, Andy" <andy.seaborne@hp.com>
Cc: Lee Feigenbaum <lee@thefigtrees.net>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <A5255BC2-5CBF-4A11-B516-BA95AAF9D599@garlik.com>

On 11 May 2009, at 16:47, Seaborne, Andy wrote:
>
>
>> -----Original Message-----
>> From: Lee Feigenbaum [mailto:figtree@gmail.com] On Behalf Of Lee
>> Feigenbaum
>> Sent: 11 May 2009 16:28
>> To: Seaborne, Andy
>> Cc: SPARQL Working Group
>> Subject: Re: ACTION-24: aggregate functions with multiple answers
>>
>> Seaborne, Andy wrote:
>>> ACTION-24: Explain potential design regarding aggregate functions  
>>> with
>> multiple answers for mixed datatypes re ISSUE-16
>>>
>>>
>>> Aggregate functions like MIN, MAX, SUM - in fact, most except  
>>> COUNT -
>> operate on the value space of a variable binding or the value of an
>> expression, which is a value. Even COUNT of terms (COUNT(?x)) needs  
>> to
>> deal with unbound variables (by skipping?).
>>>
>>> SUM(?x) requires that ?x is numeric, presumably according to the  
>>> type
>> promotion rules for XSD arithmetic operations.
>>>
>>> http://www.w3.org/TR/xpath-functions/#op.numeric
>>>
>>> What happens if SUM encounters a numeric value, such as a string or
>> date or unbound?  Because SUM works on a single value space, simply
>> ignoring nonsensical values is a possibly design.
>>>
>>> But MIN and MAX are different in that they have answers in different
>> values spaces.  MIN over numbers gives a number, MIN over dateTimes
>> gives a dateTimes etc etc.
>>>
>>> Data in RDF can be of mixed datatypes: experience with data,
>> especially combined from different sources, shows that  
>> representations
>> can vary.  In one place a dc:date property might be an XSD date, but
>> elsewhere it might be a string (all too common).
>>>
>>> If the type for the MIN operation is known by the application,  
>>> then it
>> can explicitly cast: e.g. MIN(xsd:date(?x)).  But we also need to
>> consider what happens when the application does not force the  
>> datatype.
>> MIN and MAX need to deal with incompatible data.
>>>
>>> Choices for dealing with this include:
>>>
>>> 1/ The value space for MIN is the value space of the first  
>>> encountered
>> datatype and everything incompatible is ignored.
>>>
>>> 2/ The value space has to be given - there is no single "MIN"
>> operation:
>>> e.g. MIN(xsd:dateTime, ?x)
>>>
>>> 3/ There is one answer per group for each datatype encountered in  
>>> the
>> group.  This means multiple rows per group.
>>>
>>> 4/ Error.  No query results at all.
>>
>> Thanks, Andy. I'd like to suggest that there's a 5th option as well
>> (which is what Glitter currently does):
>>
>> 5/ MIN and MAX are defined as per ORDER BY in the existing spec (
>> http://www.w3.org/TR/rdf-sparql-query/#modOrderBy ) - for ORDER BY,  
>> the
>> spec augments the '<' operator with a relative ordering of types of  
>> RDF
>> terms. This does not provide a total ordering, and the spec.  
>> explicitly
>> says that orderings in the unspecified cases are undefined.
>>
>> Effectively, (5) is saying to define MIN(?x) as the value of ?x in  
>> the
>> solution given by processing the group of solutions via ORDER BY  
>> ASC(?x)
>> LIMIT 1.
>>
>> Lee
>
> Hi Lee,
>
> That can be added for URI vs literal etc.  My examples are all  
> between literals in different valuespaces where "<" can't sensible  
> extended like string and xsd:dateTime.
>
> What does Glitter do in this case?  Does it give an answer?  An error?
>
> ARQ's ORDER does in fact always impose a total ordering on ORDER BY  
> (considering the spelling of datatype IRIs and lexical forms if  
> necessary).  I don't think that is helpful here (aggregates) because  
> if a stray type occurs it can mask the expected answer.

My systems give a total ordering too. I suspect many do.

If you want to limit the value space, you can always explicitly cast.

My preferences are 5 or 4, and encouraging people to cast, like if you  
use < and co., on data found in the wild.

- Steve

-- 
Steve Harris
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  
9AD

Received on Tuesday, 12 May 2009 09:44:47 UTC