RE: Order of applying solution sequence modifiers from Stephen Allen on 2011-09-14 (public-rdf-dawg-comments@w3.org from September 2011)

From: Stephen Allen <sallen@bbn.com>
Date: Wed, 14 Sep 2011 18:18:19 -0400
To: "'public-rdf-dawg-comments'" <public-rdf-dawg-comments@w3.org>
Cc: "'David Booth'" <david@dbooth.org>
Message-ID: <010e01cc732c$35916ca0$a0b445e0$@com>
Hi David,

Unfortunately, solution sequence modifiers cannot be rearranged as you suggest.  The Order modifier must appear before Projection in order to have access to all variables in the where clause.  Distinct and Reduced must appear after Projection since that can quite easily introduce duplicates.

As you note, the implementation is free to reorder these operations if it can produce the same solution sequence.  Implementations typically do this for query optimization purposes.

In the particular instance you cite, if the optimizer can determine a priori that there are likely to be a large number of duplicates AND there is no projection, it can perform the distinct before the order.  However, it is not clear to me that an optimizer could easily determine the number of duplicates ahead of time.  Indeed, if the number of duplicates were a small proportion of the total number of bindings, this optimization could be more expensive, since it could otherwise optimize distinct(order(...)) as reduced(order(...)).

Speaking more directly about Parliament, it uses Jena's ARQ to perform query execution.  That is slated to receive an update at some point in the future to perform non-memory bound sorting [1].  If you are interested in query optimization in ARQ and Parliament, I recommend you browse [2], [3], [4] and [5].

-Stephen

[1] https://issues.apache.org/jira/browse/JENA-44
[2] https://issues.apache.org/jira/browse/JENA-89
[3] https://issues.apache.org/jira/browse/JENA-90
[4] https://issues.apache.org/jira/browse/JENA-109
[5] https://issues.apache.org/jira/browse/JENA-111


 

> -----Original Message-----
> From: public-rdf-dawg-comments-request@w3.org [mailto:public-rdf-dawg-
> comments-request@w3.org] On Behalf Of David Booth
> Sent: Wednesday, September 14, 2011 5:23 PM
> To: public-rdf-dawg-comments
> Subject: Order of applying solution sequence modifiers
> 
> Sec 15 Solution Sequences and Modifiers
> http://www.w3.org/TR/sparql11-query/#solutionModifiers
> says that the ORDER modifier is applied before the DISTINCT modifier.
> I
> suggest that this be changed to say that ORDER is applied after
> REDUCED.
> This would be a very simple change to the spec -- moving one line in
> sec
> 15.  AFAICT it would not affect the semantics, but I think it would
> help
> implementers.
> 
> The reason I suggest this is that an implementation performing the
> ORDER
> before DISTINCT must generate *all* (non-distinct) solutions in order
> to
> sort them, in spite of the fact that the ordering of the solutions can
> never affect the elimination of those that are non-distinct.  AFAICT it
> would be much more efficient to perform the ORDER *after* performing
> DISTINCT (or REDUCED).
> 
> For example, there may be 100 million solutions but only 20 distinct
> solutions.  If SORT is done first, then all 100 million solutions must
> be held in memory at once and sorted, before later reducing the result
> set to 20.  But if DISTINCT is done first, then duplicate solutions can
> be discarded *as* they are generated, so only 20 need to be held in
> memory.
> 
> Granted, since an implementation is free to optimize things however it
> wishes anyway -- provided that it still produces the correct results --
> it could perform the ORDER modifier after the DISTINCT and REDUCED
> modifiers even if the spec says to do them the other way around.  So in
> that sense there is no absolute need to change the order of these
> modifiers in the SPARQL specification.  However, since most
> implementations will first want to ensure correctness and later
> optimize, I think it would be helpful to change the SPARQL
> specification
> to say that ORDER is done after REDUCED.   Indeed, the reason I ran
> into
> this is that the implementation that I'm currently using (Parliament
> 2.7.1) appears to be doing the SORT first (per the spec), thus causing
> a
> query that I thought should have run quite efficiently to instead run
> the system out of memory.
> 
> Thanks!
> 
> 
> --
> David Booth, Ph.D.
> http://dbooth.org/
> 
> Opinions expressed herein are those of the author and do not
> necessarily
> reflect those of his employer.
Received on Wednesday, 14 September 2011 22:18:58 UTC