- From: Kevin Wilkinson <wilkinson@hpl.hp.com>
- Date: Wed, 09 Jun 2004 16:12:28 -0700
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
- Cc: "Eric Prud'hommeaux" <eric@w3.org>, "Seaborne, Andy" <andy.seaborne@hp.com>
re: requirement 3.10, i'll restate my concern about making the LIMIT an exact bound on the number of "results" returned. i think an exact limit may have a negative impact on performance for a common query processing scenario. i think it should be an upper bound. here's why. assumption: in a common scenario, the query processing is distributed among a client module (providing a query API to an application program) and one or more servers (data sources), likely accessed across a network. the client module may do additional processing over the results from the servers (e.g., for inferencing, union, whatever) and this may add or remove results. assumption: the primary motivation of the LIMIT clause is to reduce query latency for queries where the result size is unknown or known to be undesirably large. to support an exact limit, the client module first decomposes the query and dispatches subqueries to the servers. to retrieve the results, the client module must open a stream connection and keep that stream open until the LIMIT is reached by the client module. the problems with this approach are (1) creating a stream connection is more expensive than returning a single batch of results (2) in the common case of JDBC access to a database, the server is required to materialize the entire result set of the subquery. so, the LIMIT clause cannot be extended to the servers. consequently, the application may see little or no reduction in query latency (which is a primary motivation for the LIMIT clause). an alternative approach would allow the client module to pass a LIMIT clause on to the servers. the servers would then process their subqueries with that LIMIT (which could be larger than the original query LIMIT) and return that result as one batch to the client module. the problem with the alternative is that, if the client module does a lot of filtering on the partial results, then the LIMIT is not reached, despite the fact that are more matching results available on the servers. the alternative has much better behavior with respect to system resources and latency because it can pass the LIMIT clause to the servers. since this is the primary motivation for the LIMIT clause (faster response time), i prefer the alternative. in the case that the client module filters out too many results (which, i expect would be rare), i propose that an error condition or indicator be raised to the application which can then increase the LIMIT. as an aside, we chose not to support aggregates. yet, making LIMIT an exact bound makes it a type of aggregate query (it's really doing a count query). kevin Eric Prud'hommeaux wrote: > > On Wed, Jun 09, 2004 at 02:07:21PM +0100, Seaborne, Andy wrote: > > > > > > -- 3.10 Result Limits > > > > > > > > straw poll: who's convinced this is a requirements? are we close > > > > to a decision here? > > > AndyS: RDQL does not have a limit. Joseki provides limits. > > > PatH: it could mean 2 things: I don't want more than 10 answers > > > and the next ones after that or *just* 10 > > > > It occurred to me that one way of differentiating between exactly 10 and > > more than 10 (and an unknown number) when we have "LIMIT 10" is to ask for > > 11. If 11 results return up, it is more than 10. Otherwise, it is exact > > and 10 or less. This is making the (trailing) flag we discussed a matter of > > whether the 11th slot gets filled at all. > > > > Hence, no requirement for a special flag and the client can decide whether > > to ask for 11 if it needes to know the difference. It still might be a > > better design overall to have a flag. > > This came up during the telecon. I think the outcome was that it > depended on what extra bit you had in the respones. It could be > a boolean say there were more results; it could be a count of > those results: > > query response server network > limit extra bit effort burden > ----- --------- ------ ------ > none none full full > 10 none light1 light > 10 boolean light2 light > 11 none light2 light > 10 count full light > > where "light" meant the effort was shortcutted by the limit. > "light2" means the server had to compute one more tuple (or > the existance of) than "light1". > > In the above, I assume tuples as the unit of measure. Rob raised > the issue that (correct me here) the client could be doing some > inferencing which would give more [or fewer, i note] results than > specified in the limit. I think that's fine, that it does't reduce > the utility of specifying a limit. > > > > AndyS: seen this to prevent error conditions > > > EricP: SQL has it > > > RobS: there are other things in SQL that we don't have > > > Steve: can affect latency > > > AndyS: can be either a query or a protocol issue > > > EricP: arguing from the point of view of a straightforward > > > implementation (for both client and server) > > > Yoshio: but then we need a way to specify the snapshot of the database > > > RobS: what is understood as a "result" > > > Kevin: do we want an exact upper bound? could cause more impl burden on > > > the server > > > DanC: does "upper bound" appeal to somebody? > > > Kevin: yes > > > FarrukhNajmi: +1 > > > PatH: DQL could, by request, give a count of results > > > DanC: motivation to spend time on this is in discussing/bounderaries > > > with > > > other groups > > -- > -eric > > office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA > cell: +1.857.222.5741 > > (eric@w3.org) > Feel free to forward this message to any list for any purpose other than > email address distribution.
Received on Wednesday, 9 June 2004 19:11:08 UTC