Order of applying solution sequence modifiers from David Booth on 2011-09-14 (public-rdf-dawg-comments@w3.org from September 2011)

From: David Booth <david@dbooth.org>
Date: Wed, 14 Sep 2011 17:22:50 -0400
To: public-rdf-dawg-comments <public-rdf-dawg-comments@w3.org>
Message-ID: <1316035370.2882.80595.camel@dbooth-laptop>

Sec 15 Solution Sequences and Modifiers
http://www.w3.org/TR/sparql11-query/#solutionModifiers
says that the ORDER modifier is applied before the DISTINCT modifier.  I
suggest that this be changed to say that ORDER is applied after REDUCED.
This would be a very simple change to the spec -- moving one line in sec
15.  AFAICT it would not affect the semantics, but I think it would help
implementers.

The reason I suggest this is that an implementation performing the ORDER
before DISTINCT must generate *all* (non-distinct) solutions in order to
sort them, in spite of the fact that the ordering of the solutions can
never affect the elimination of those that are non-distinct.  AFAICT it
would be much more efficient to perform the ORDER *after* performing
DISTINCT (or REDUCED).

For example, there may be 100 million solutions but only 20 distinct
solutions.  If SORT is done first, then all 100 million solutions must
be held in memory at once and sorted, before later reducing the result
set to 20.  But if DISTINCT is done first, then duplicate solutions can
be discarded *as* they are generated, so only 20 need to be held in
memory.

Granted, since an implementation is free to optimize things however it
wishes anyway -- provided that it still produces the correct results --
it could perform the ORDER modifier after the DISTINCT and REDUCED
modifiers even if the spec says to do them the other way around.  So in
that sense there is no absolute need to change the order of these
modifiers in the SPARQL specification.  However, since most
implementations will first want to ensure correctness and later
optimize, I think it would be helpful to change the SPARQL specification
to say that ORDER is done after REDUCED.   Indeed, the reason I ran into
this is that the implementation that I'm currently using (Parliament
2.7.1) appears to be doing the SORT first (per the spec), thus causing a
query that I thought should have run quite efficiently to instead run
the system out of memory.

Thanks!


-- 
David Booth, Ph.D.
http://dbooth.org/

Opinions expressed herein are those of the author and do not necessarily
reflect those of his employer.

Received on Wednesday, 14 September 2011 21:23:15 UTC