RE: LIMITs from Seaborne, Andy on 2004-06-11 (public-rdf-dawg@w3.org from April to June 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Fri, 11 Jun 2004 12:03:55 +0100
To: Kevin Wilkinson <wilkinson@hpl.hp.com>, RDF Data Access Working Group <public-rdf-dawg@w3.org>
Cc: "Eric Prud'hommeaux" <eric@w3.org>
Message-ID: <E864E95CB35C1C46B72FEA0626A2E808036155D3@0-mail-br1.hpl.hp.com>
I had been thinking of LIMIT as network bandwidth reducing.  A server could
well implement it by sendign the N results then aborting the query
internally.  I hadn't though of the impliciations in multi-tier systems but
with a more external view of the effect of LIMIT, could we regard it as more
of a hint internally?  Some systems may be able to make use of it (e.g and
RDF graph stored in DB, no intermediate tiers).

More below:

-------- Original Message --------
> From: Kevin Wilkinson <>
> Date: 10 June 2004 00:12
> 
> re: requirement 3.10, i'll restate my concern about making
> the LIMIT an exact bound on the number of "results" returned.
> i think an exact limit may have a negative impact on performance
> for a common query processing scenario. i think it should be
> an upper bound. here's why.
> 
> assumption: in a common scenario, the query processing is distributed
>    among a client module (providing a query API to an application
>    program) and one or more servers (data sources), likely accessed
>    across a network. the client module may do additional processing
>    over the results from the servers (e.g., for inferencing, union,
>    whatever) and this may add or remove results.

Good point.

> 
> assumption: the primary motivation of the LIMIT clause is to
>    reduce query latency for queries where the result size is unknown
>    or known to be undesirably large.

I saw it can bandwidth limitign and the client protecting itself.

In the absense of SORT() or some aggregate like COUNT(), the LIMIT on the
DAWG request may well help some systems avoiding creating the entire result
set.  It may be no help but it has not done harm.

> 
> to support an exact limit, the client module first decomposes the query
> and dispatches subqueries to the servers. to retrieve the results,
> the client module must open a stream connection and keep that stream
> open until the LIMIT is reached by the client module.
> 
> the problems with this approach are (1) creating a stream connection
> is more expensive than returning a single batch of results (2) in the
> common case of JDBC access to a database, the server is required to
> materialize the entire result set of the subquery. so, the LIMIT clause
> cannot be extended to the servers. consequently, the application may see
> little or no reduction in query latency (which is a primary motivation
> for the LIMIT clause).

If the web is slow, the client would see the end-of-request sooner.  I agree
that it does not necessrily help the server in reducing work but it does
reduce the amount of bandwidth if the results are truncated at the limit.

> 
> an alternative approach would allow the client module to pass a LIMIT
> clause on to the servers. the servers would then process their
> subqueries
> with that LIMIT (which could be larger than the original query LIMIT)
> and return that result as one batch to the client module.
> 
> the problem with the alternative is that, if the client module does
> a lot of filtering on the partial results, then the LIMIT is not
> reached, despite the fact that are more matching results available on
> the servers.
> 
> the alternative has much better behavior with respect to system
> resources and latency because it can pass the LIMIT clause to the
> servers. since this is the primary motivation for the LIMIT clause
> (faster response time), i prefer the alternative. in the case that
> the client module filters out too many results (which, i expect
> would be rare), i propose that an error condition or indicator
> be raised to the application which can then increase the LIMIT.
> 
> as an aside, we chose not to support aggregates. yet, making
> LIMIT an exact bound makes it a type of aggregate query (it's
> really doing a count query).
> 
> kevin
> 
> 
> 
> Eric Prud'hommeaux wrote:
> > 
> > On Wed, Jun 09, 2004 at 02:07:21PM +0100, Seaborne, Andy wrote:
> > > 
> > > > > -- 3.10 Result Limits
> > > > > 
> > > > > straw poll: who's convinced this is a requirements? are we close
> > > > > to a decision here?
> > > > AndyS: RDQL does not have a limit. Joseki provides limits.
> > > > PatH: it could mean 2 things: I don't want more than 10 answers
> > > >   and the next ones after that or *just* 10
> > > 
> > > It occurred to me that one way of differentiating between exactly
> > > 10 and more than 10 (and an unknown number) when we have "LIMIT 10"
> > > is to ask for 
> > > 11.  If 11 results return up, it is more than 10.  Otherwise, it is
> > > exact and 10 or less.  This is making the (trailing) flag we
> > > discussed a matter of whether the 11th slot gets filled at all.
> > > 
> > > Hence, no requirement for a special flag and the client can decide
> > > whether to ask for 11 if it needes to know the difference.  It
> > > still might be a better design overall to have a flag.
> > 
> > This came up during the telecon. I think the outcome was that it
> > depended on what extra bit you had in the respones. It could be
> > a boolean say there were more results; it could be a count of
> > those results:
> > 
> > query   response        server  network
> > limit   extra bit       effort  burden
> > -----   ---------       ------  ------
> > none    none            full    full
> > 10      none            light1  light
> > 10      boolean         light2  light
> > 11      none            light2  light
> > 10      count           full    light
> > 
> > where "light" meant the effort was shortcutted by the limit.
> > "light2" means the server had to compute one more tuple (or
> > the existance of) than "light1".
> > 
> > In the above, I assume tuples as the unit of measure. Rob raised
> > the issue that (correct me here) the client could be doing some
> > inferencing which would give more [or fewer, i note] results than
> > specified in the limit. I think that's fine, that it does't reduce
> > the utility of specifying a limit.
> > 
> > > > AndyS: seen this to prevent error conditions
> > > > EricP: SQL has it
> > > > RobS: there are other things in SQL that we don't have
> > > > Steve: can affect latency
> > > > AndyS: can be either a query or a protocol issue
> > > > EricP: arguing from the point of view of a straightforward
> > > >   implementation (for both client and server)
> > > > Yoshio: but then we need a way to specify the snapshot of the
> > > > database RobS: what is understood as a "result"
> > > > Kevin: do we want an exact upper bound? could cause more impl
> > > > burden on the server DanC: does "upper bound" appeal to somebody?
> > > > Kevin: yes
> > > > FarrukhNajmi: +1
> > > > PatH: DQL could, by request, give a count of results
> > > > DanC: motivation to spend time on this is in
> > > > discussing/bounderaries with other groups
> > 
> > --
> > -eric
> > 
> > office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
> > cell:   +1.857.222.5741
> > 
> > (eric@w3.org)
> > Feel free to forward this message to any list for any purpose other
> > than email address distribution.
Received on Friday, 11 June 2004 07:09:15 UTC