RE: ACTION-24: aggregate functions with multiple answers from Seaborne, Andy on 2009-05-11 (public-rdf-dawg@w3.org from April to June 2009)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Mon, 11 May 2009 15:47:19 +0000
To: Lee Feigenbaum <lee@thefigtrees.net>
CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <B6CF1054FDC8B845BF93A6645D19BEA362D1E543C1@GVW1118EXC.americas.hpqcorp.net>


> -----Original Message-----
> From: Lee Feigenbaum [mailto:figtree@gmail.com] On Behalf Of Lee
> Feigenbaum
> Sent: 11 May 2009 16:28
> To: Seaborne, Andy
> Cc: SPARQL Working Group
> Subject: Re: ACTION-24: aggregate functions with multiple answers
> 
> Seaborne, Andy wrote:
> > ACTION-24: Explain potential design regarding aggregate functions with
> multiple answers for mixed datatypes re ISSUE-16
> >
> >
> > Aggregate functions like MIN, MAX, SUM - in fact, most except COUNT -
> operate on the value space of a variable binding or the value of an
> expression, which is a value. Even COUNT of terms (COUNT(?x)) needs to
> deal with unbound variables (by skipping?).
> >
> > SUM(?x) requires that ?x is numeric, presumably according to the type
> promotion rules for XSD arithmetic operations.
> >
> > http://www.w3.org/TR/xpath-functions/#op.numeric

> >
> > What happens if SUM encounters a numeric value, such as a string or
> date or unbound?  Because SUM works on a single value space, simply
> ignoring nonsensical values is a possibly design.
> >
> > But MIN and MAX are different in that they have answers in different
> values spaces.  MIN over numbers gives a number, MIN over dateTimes
> gives a dateTimes etc etc.
> >
> > Data in RDF can be of mixed datatypes: experience with data,
> especially combined from different sources, shows that representations
> can vary.  In one place a dc:date property might be an XSD date, but
> elsewhere it might be a string (all too common).
> >
> > If the type for the MIN operation is known by the application, then it
> can explicitly cast: e.g. MIN(xsd:date(?x)).  But we also need to
> consider what happens when the application does not force the datatype.
> MIN and MAX need to deal with incompatible data.
> >
> > Choices for dealing with this include:
> >
> > 1/ The value space for MIN is the value space of the first encountered
> datatype and everything incompatible is ignored.
> >
> > 2/ The value space has to be given - there is no single "MIN"
> operation:
> > e.g. MIN(xsd:dateTime, ?x)
> >
> > 3/ There is one answer per group for each datatype encountered in the
> group.  This means multiple rows per group.
> >
> > 4/ Error.  No query results at all.
> 
> Thanks, Andy. I'd like to suggest that there's a 5th option as well
> (which is what Glitter currently does):
> 
> 5/ MIN and MAX are defined as per ORDER BY in the existing spec (
> http://www.w3.org/TR/rdf-sparql-query/#modOrderBy ) - for ORDER BY, the
> spec augments the '<' operator with a relative ordering of types of RDF
> terms. This does not provide a total ordering, and the spec. explicitly
> says that orderings in the unspecified cases are undefined.
> 
> Effectively, (5) is saying to define MIN(?x) as the value of ?x in the
> solution given by processing the group of solutions via ORDER BY ASC(?x)
> LIMIT 1.
> 
> Lee

Hi Lee,

That can be added for URI vs literal etc.  My examples are all between literals in different valuespaces where "<" can't sensible extended like string and xsd:dateTime.  

What does Glitter do in this case?  Does it give an answer?  An error?

ARQ's ORDER does in fact always impose a total ordering on ORDER BY (considering the spelling of datatype IRIs and lexical forms if necessary).  I don't think that is helpful here (aggregates) because if a stray type occurs it can mask the expected answer.

 Andy

> 
> 
> >
> > Even in (3), literals of unknown (not understood by this process)
> datatype, and unbounds, would be ignored.  Warnings up to the
> implementation but the results are the same for all processors.
> >
> > (1) has the unfortunate effect that the answer can change depending on
> the order data is encountered, so isn't fixed even for a single query
> processor.
> >
> > (4) is hard for scaling - the error may be encountered at the end of
> the data when some results were ready much earlier but can't be sent
> until the query is known to be successful.  An effect of HTTP requiring
> the return code first - "200 OK" is seen as promising results, not an
> error half way through.
> >
> >  Andy
> >
> >
> > --------------------------------------------
> >   Hewlett-Packard Limited
> >   Registered Office: Cain Road, Bracknell, Berks RG12 1HN
> >   Registered No: 690597 England
> >
Received on Monday, 11 May 2009 15:48:30 UTC