RE: Order of applying solution sequence modifiers from David Booth on 2011-09-15 (public-rdf-dawg-comments@w3.org from September 2011)

From: David Booth <david@dbooth.org>
Date: Wed, 14 Sep 2011 21:01:25 -0400
To: Stephen Allen <sallen@bbn.com>
Cc: 'public-rdf-dawg-comments' <public-rdf-dawg-comments@w3.org>
Message-ID: <1316048485.2882.83239.camel@dbooth-laptop>
Ah yes, I see what you mean.  It is fairly common to want to ORDER by
variables that are not projected, so that would constrain ORDER to be
done before projection.

I withdraw my suggestion.  Thanks for your help!

David


On Wed, 2011-09-14 at 18:18 -0400, Stephen Allen wrote:
> Hi David,
> 
> Unfortunately, solution sequence modifiers cannot be rearranged as you
> suggest.  The Order modifier must appear before Projection in order to
> have access to all variables in the where clause.  Distinct and
> Reduced must appear after Projection since that can quite easily
> introduce duplicates.
> 
> As you note, the implementation is free to reorder these operations if
> it can produce the same solution sequence.  Implementations typically
> do this for query optimization purposes.
> 
> In the particular instance you cite, if the optimizer can determine a
> priori that there are likely to be a large number of duplicates AND
> there is no projection, it can perform the distinct before the order.
> However, it is not clear to me that an optimizer could easily
> determine the number of duplicates ahead of time.  Indeed, if the
> number of duplicates were a small proportion of the total number of
> bindings, this optimization could be more expensive, since it could
> otherwise optimize distinct(order(...)) as reduced(order(...)).
> 
> Speaking more directly about Parliament, it uses Jena's ARQ to perform
> query execution.  That is slated to receive an update at some point in
> the future to perform non-memory bound sorting [1].  If you are
> interested in query optimization in ARQ and Parliament, I recommend
> you browse [2], [3], [4] and [5].
> 
> -Stephen
> 
> [1] https://issues.apache.org/jira/browse/JENA-44
> [2] https://issues.apache.org/jira/browse/JENA-89
> [3] https://issues.apache.org/jira/browse/JENA-90
> [4] https://issues.apache.org/jira/browse/JENA-109
> [5] https://issues.apache.org/jira/browse/JENA-111
> 
> 
>  
> 
> > -----Original Message-----
> > From: public-rdf-dawg-comments-request@w3.org [mailto:public-rdf-dawg-
> > comments-request@w3.org] On Behalf Of David Booth
> > Sent: Wednesday, September 14, 2011 5:23 PM
> > To: public-rdf-dawg-comments
> > Subject: Order of applying solution sequence modifiers
> > 
> > Sec 15 Solution Sequences and Modifiers
> > http://www.w3.org/TR/sparql11-query/#solutionModifiers
> > says that the ORDER modifier is applied before the DISTINCT modifier.
> > I
> > suggest that this be changed to say that ORDER is applied after
> > REDUCED.
> > This would be a very simple change to the spec -- moving one line in
> > sec
> > 15.  AFAICT it would not affect the semantics, but I think it would
> > help
> > implementers.
> > 
> > The reason I suggest this is that an implementation performing the
> > ORDER
> > before DISTINCT must generate *all* (non-distinct) solutions in order
> > to
> > sort them, in spite of the fact that the ordering of the solutions can
> > never affect the elimination of those that are non-distinct.  AFAICT it
> > would be much more efficient to perform the ORDER *after* performing
> > DISTINCT (or REDUCED).
> > 
> > For example, there may be 100 million solutions but only 20 distinct
> > solutions.  If SORT is done first, then all 100 million solutions must
> > be held in memory at once and sorted, before later reducing the result
> > set to 20.  But if DISTINCT is done first, then duplicate solutions can
> > be discarded *as* they are generated, so only 20 need to be held in
> > memory.
> > 
> > Granted, since an implementation is free to optimize things however it
> > wishes anyway -- provided that it still produces the correct results --
> > it could perform the ORDER modifier after the DISTINCT and REDUCED
> > modifiers even if the spec says to do them the other way around.  So in
> > that sense there is no absolute need to change the order of these
> > modifiers in the SPARQL specification.  However, since most
> > implementations will first want to ensure correctness and later
> > optimize, I think it would be helpful to change the SPARQL
> > specification
> > to say that ORDER is done after REDUCED.   Indeed, the reason I ran
> > into
> > this is that the implementation that I'm currently using (Parliament
> > 2.7.1) appears to be doing the SORT first (per the spec), thus causing
> > a
> > query that I thought should have run quite efficiently to instead run
> > the system out of memory.
> > 
> > Thanks!
> > 
> > 
> > --
> > David Booth, Ph.D.
> > http://dbooth.org/
> > 
> > Opinions expressed herein are those of the author and do not
> > necessarily
> > reflect those of his employer.
> 
> 
> 
> 

-- 
David Booth, Ph.D.
http://dbooth.org/

Opinions expressed herein are those of the author and do not necessarily
reflect those of his employer.
Received on Thursday, 15 September 2011 01:02:01 UTC