Re: Minutes of RDF DAWG telecon Tuesday 2004-06-08 for review

re: requirement 3.10, i'll restate my concern about making
the LIMIT an exact bound on the number of "results" returned.
i think an exact limit may have a negative impact on performance
for a common query processing scenario. i think it should be
an upper bound. here's why.

assumption: in a common scenario, the query processing is distributed
   among a client module (providing a query API to an application
   program) and one or more servers (data sources), likely accessed
   across a network. the client module may do additional processing
   over the results from the servers (e.g., for inferencing, union,
   whatever) and this may add or remove results.

assumption: the primary motivation of the LIMIT clause is to
   reduce query latency for queries where the result size is unknown
   or known to be undesirably large.

to support an exact limit, the client module first decomposes the query
and dispatches subqueries to the servers. to retrieve the results,
the client module must open a stream connection and keep that stream
open until the LIMIT is reached by the client module.

the problems with this approach are (1) creating a stream connection
is more expensive than returning a single batch of results (2) in the
common case of JDBC access to a database, the server is required to
materialize the entire result set of the subquery. so, the LIMIT clause
cannot be extended to the servers. consequently, the application may see
little or no reduction in query latency (which is a primary motivation
for the LIMIT clause).

an alternative approach would allow the client module to pass a LIMIT
clause on to the servers. the servers would then process their
subqueries
with that LIMIT (which could be larger than the original query LIMIT)
and return that result as one batch to the client module.

the problem with the alternative is that, if the client module does
a lot of filtering on the partial results, then the LIMIT is not
reached, despite the fact that are more matching results available on
the servers.

the alternative has much better behavior with respect to system
resources and latency because it can pass the LIMIT clause to the
servers. since this is the primary motivation for the LIMIT clause
(faster response time), i prefer the alternative. in the case that
the client module filters out too many results (which, i expect
would be rare), i propose that an error condition or indicator
be raised to the application which can then increase the LIMIT.

as an aside, we chose not to support aggregates. yet, making
LIMIT an exact bound makes it a type of aggregate query (it's
really doing a count query).

kevin

 

Eric Prud'hommeaux wrote:
> 
> On Wed, Jun 09, 2004 at 02:07:21PM +0100, Seaborne, Andy wrote:
> >
> > > > -- 3.10 Result Limits
> > > >
> > > > straw poll: who's convinced this is a requirements? are we close
> > > > to a decision here?
> > > AndyS: RDQL does not have a limit. Joseki provides limits.
> > > PatH: it could mean 2 things: I don't want more than 10 answers
> > >   and the next ones after that or *just* 10
> >
> > It occurred to me that one way of differentiating between exactly 10 and
> > more than 10 (and an unknown number) when we have "LIMIT 10" is to ask for
> > 11.  If 11 results return up, it is more than 10.  Otherwise, it is exact
> > and 10 or less.  This is making the (trailing) flag we discussed a matter of
> > whether the 11th slot gets filled at all.
> >
> > Hence, no requirement for a special flag and the client can decide whether
> > to ask for 11 if it needes to know the difference.  It still might be a
> > better design overall to have a flag.
> 
> This came up during the telecon. I think the outcome was that it
> depended on what extra bit you had in the respones. It could be
> a boolean say there were more results; it could be a count of
> those results:
> 
> query   response        server  network
> limit   extra bit       effort  burden
> -----   ---------       ------  ------
> none    none            full    full
> 10      none            light1  light
> 10      boolean         light2  light
> 11      none            light2  light
> 10      count           full    light
> 
> where "light" meant the effort was shortcutted by the limit.
> "light2" means the server had to compute one more tuple (or
> the existance of) than "light1".
> 
> In the above, I assume tuples as the unit of measure. Rob raised
> the issue that (correct me here) the client could be doing some
> inferencing which would give more [or fewer, i note] results than
> specified in the limit. I think that's fine, that it does't reduce
> the utility of specifying a limit.
> 
> > > AndyS: seen this to prevent error conditions
> > > EricP: SQL has it
> > > RobS: there are other things in SQL that we don't have
> > > Steve: can affect latency
> > > AndyS: can be either a query or a protocol issue
> > > EricP: arguing from the point of view of a straightforward
> > >   implementation (for both client and server)
> > > Yoshio: but then we need a way to specify the snapshot of the database
> > > RobS: what is understood as a "result"
> > > Kevin: do we want an exact upper bound? could cause more impl burden on
> > > the server
> > > DanC: does "upper bound" appeal to somebody?
> > > Kevin: yes
> > > FarrukhNajmi: +1
> > > PatH: DQL could, by request, give a count of results
> > > DanC: motivation to spend time on this is in discussing/bounderaries
> > > with
> > > other groups
> 
> --
> -eric
> 
> office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
> cell:   +1.857.222.5741
> 
> (eric@w3.org)
> Feel free to forward this message to any list for any purpose other than
> email address distribution.

Received on Wednesday, 9 June 2004 19:11:08 UTC