Re: requirement 3.10 (was Re: Minutes of RDF DAWG telecon Tuesday 2004-06-08 for review) from Eric Prud'hommeaux on 2004-06-11 (public-rdf-dawg@w3.org from April to June 2004)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 11 Jun 2004 13:33:08 +0900
To: Kevin Wilkinson <wilkinson@hpl.hp.com>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>, "Seaborne, Andy" <andy.seaborne@hp.com>
Message-ID: <20040611043308.GD12275@w3.org>
On Thu, Jun 10, 2004 at 04:36:18PM -0700, Kevin Wilkinson wrote:
> 
> on specifying an upper bound ...
> 
> note that federation is just the general case. the simple
> case has the same problem, where the simple case is a 3-tier
> architecture consisting of an RDF application talking
> to an RDF query processor talking to a database store
> containing RDF triples.
> 
> you suggested a work-around in which the query processor can
> pass the LIMIT to the database and then process the results
> (filter/infer/whatever). if the result set size is less than
> the LIMIT, the query processor should discard the result and
> resend the query to the database with a higher LIMIT.
> 
> this requires the query processor to completely buffer the
> result set before returning any partial result to the
> application (to ensure that the result set size reaches the
> LIMIT). is this acceptable, i.e., having to buffer the result?
> i thought not. it also forces a more complicated architecture
> on the query processor.

But how can you promise query satisfaction if you take advantage of
this upper bound? It doesn't seem sufficient to spit the query, ask
the for clause 1 LIMIT 10, ask for clause 2 LIMIT 10, and see if any
of them happen to line up.

I think that, if you don't assume the more complicated architecture,
you provide the client with only a statistical promise of success.
Even if you ask for clause 1 LIMIT 10 and then ask for clause 2
iteratively, using the bindings from the clause 1 results, you run
a risk that those particular clause 1 results don't happen to unify
with the KB answering clause 2.

> Eric Prud'hommeaux wrote:
> > 
> > On Thu, Jun 10, 2004 at 07:06:29AM +0900, Eric Prud'hommeaux wrote:
> > >
> > > In the above, I assume tuples as the unit of measure. Rob raised
> > > the issue that (correct me here) the client could be doing some
> > 
> > oops, sorry, make that Kevin.
> > 
> > On Wed, Jun 09, 2004 at 04:12:28PM -0700, Kevin Wilkinson wrote:
> > >
> > > re: requirement 3.10, i'll restate my concern about making
> > > the LIMIT an exact bound on the number of "results" returned.
> > > i think an exact limit may have a negative impact on performance
> > > for a common query processing scenario. i think it should be
> > > an upper bound. here's why.
> > 
> > I fear that this encourages returning 0 results when the results to your
> > federated queries didn't happen to line up. Specifics down 100 lines.
> > 
> > > assumption: in a common scenario, the query processing is distributed
> > >    among a client module (providing a query API to an application
> > >    program) and one or more servers (data sources), likely accessed
> > >    across a network. the client module may do additional processing
> > >    over the results from the servers (e.g., for inferencing, union,
> > >    whatever) and this may add or remove results.
> > >
> > > assumption: the primary motivation of the LIMIT clause is to
> > >    reduce query latency for queries where the result size is unknown
> > >    or known to be undesirably large.
> > >
> > > to support an exact limit, the client module first decomposes the query
> > > and dispatches subqueries to the servers. to retrieve the results,
> > > the client module must open a stream connection and keep that stream
> > > open until the LIMIT is reached by the client module.
> > >
> > > the problems with this approach are (1) creating a stream connection
> > > is more expensive than returning a single batch of results (2) in the
> > > common case of JDBC access to a database, the server is required to
> > > materialize the entire result set of the subquery. so, the LIMIT clause
> > > cannot be extended to the servers. consequently, the application may see
> > > little or no reduction in query latency (which is a primary motivation
> > > for the LIMIT clause).
> > 
> > I believe the spottiness of JDBC setMaxRows(int) implementation comes
> > from the lack of LIMIT in SQL92 entry or intermediate specifications.
> > JDBC drivers have to use server-dependent hacks to provide this
> > functionality.
> > 
> > This argues *for* putting such a function into DAWG-QL.
> > 
> > The MySQL JDBC driver could use their "LIMIT x" extension to SQL, but
> > it doesn't actually parse the SQL query and therefor doesn't know how
> > to stick the "LIMIT x" clause to implement setMaxRows or scrollable
> > result sets.
> > 
> > This argues that even if we put it in the QL, folks might be too lame
> > to use it.
> > 
> > > an alternative approach would allow the client module to pass a LIMIT
> > > clause on to the servers. the servers would then process their
> > > subqueries
> > > with that LIMIT (which could be larger than the original query LIMIT)
> > > and return that result as one batch to the client module.
> > >
> > > the problem with the alternative is that, if the client module does
> > > a lot of filtering on the partial results, then the LIMIT is not
> > > reached, despite the fact that are more matching results available on
> > > the servers.
> > >
> > > the alternative has much better behavior with respect to system
> > > resources and latency because it can pass the LIMIT clause to the
> > > servers. since this is the primary motivation for the LIMIT clause
> > > (faster response time), i prefer the alternative. in the case that
> > > the client module filters out too many results (which, i expect
> > > would be rare), i propose that an error condition or indicator
> > > be raised to the application which can then increase the LIMIT.
> > >
> > > as an aside, we chose not to support aggregates. yet, making
> > > LIMIT an exact bound makes it a type of aggregate query (it's
> > > really doing a count query).
> > 
> > I'm still fuzzy on aggregate queries. Can nth queries reference
> > variables used in the 1st query?
> > 
> > Suppose I want some people at the New York site with chronic conditions:
> > 
> > SELECT (who, dept)
> > AT <http://example.com/personel>
> > FROM (who p:department dept)
> >      (who p:site "New York")
> > USING p FOR <http://example.com/p-schema>
> > 
> > [ aggregate with ]
> > 
> > SELECT (illness)
> > AT <http://example.com/medRecords>
> > FROM (who m:record x)
> >      (x m:chronic illness)
> > USING m FOR <http://example.com/m-schema>
> > 
> > is the second "who" defined by the results of the first query?  It
> > seems that, regardless of whether users will know how to use them,
> > LIMITs on clause 1, clause 2, or the whole thing could have easily-
> > defined semantics. eg. A LIMIT on clause 1 controls how many whos
> > we are looking at. A LIMIT on clause 2 controls how many illnesses
> > we look FOR EACH who. A LIMIT on the whole thing controls how many
> > results we get back at the end of the day.
> > 
> > There is no way to say how a LIMIT on the whole thing would translate
> > to LIMIT 1 without knowing something about the data. Perhaps each
> > person is allowed no more than two chronic illnesses, but the very
> > nature of being in New Your guarantees at least one illness.
> > 
> > I think that the complexities of query federation should not prevent
> > our protocol/ql from having otherwise critical features. Whether LIMIT
> > is critical? Up to the jury.
> > 
> > SPECIFYING AN UPPER BOUND
> > 
> > I think the only way a federater can take advantage of the upper bound
> > approach is to carry the aggregate LIMIT to each of the sub clauses.
> > Suppose the user imposes an (aggregate) LIMIT of 10. We then search
> > for 10 employees in New York. We then impose the LIMIT on clause 2.
> > We are asking "of the 10 people I happened to bind in the first query,
> > give me up to 10 of their illnesses*." If illnesses are scarce (maybe
> > 1 in 21? not so good at stats), we are likely to return 0 answers.
> > This is not what the person asked for.
> > 
> > Instead I think we have to do what Kevin said, keep a stream open to
> > the first server and keep asking the second server for matches on each
> > batch of results from the first server until we reach the limit or the
> > end of the solutions from the first server. But this implies a stream,
> > and that makes me think of cursors and they are mentioned in the
> > charter as being out of scope.
> > 
> > A stateless way to do this is to passs the aggregate LIMIT to the
> > first (and second) server. If, at the end, we get too few results,
> > multiply by 10 and retry. Then multiply by 100...
> > 
> > Federation isn't easy. I don't think we have to write query
> > specifications that make it trivial to implement federaters.
> > 
> > * Names only, please; I don't want to gave the illnesses myself.
> > 
> > > Eric Prud'hommeaux wrote:
> > > >
> > > > On Wed, Jun 09, 2004 at 02:07:21PM +0100, Seaborne, Andy wrote:
> > > > >
> > > > > > > -- 3.10 Result Limits
> > > > > > >
> > > > > > > straw poll: who's convinced this is a requirements? are we close
> > > > > > > to a decision here?
> > > > > > AndyS: RDQL does not have a limit. Joseki provides limits.
> > > > > > PatH: it could mean 2 things: I don't want more than 10 answers
> > > > > >   and the next ones after that or *just* 10
> > > > >
> > > > > It occurred to me that one way of differentiating between exactly 10 and
> > > > > more than 10 (and an unknown number) when we have "LIMIT 10" is to ask for
> > > > > 11.  If 11 results return up, it is more than 10.  Otherwise, it is exact
> > > > > and 10 or less.  This is making the (trailing) flag we discussed a matter of
> > > > > whether the 11th slot gets filled at all.
> > > > >
> > > > > Hence, no requirement for a special flag and the client can decide whether
> > > > > to ask for 11 if it needes to know the difference.  It still might be a
> > > > > better design overall to have a flag.
> > > >
> > > > This came up during the telecon. I think the outcome was that it
> > > > depended on what extra bit you had in the respones. It could be
> > > > a boolean say there were more results; it could be a count of
> > > > those results:
> > > >
> > > > query   response        server  network
> > > > limit   extra bit       effort  burden
> > > > -----   ---------       ------  ------
> > > > none    none            full    full
> > > > 10      none            light1  light
> > > > 10      boolean         light2  light
> > > > 11      none            light2  light
> > > > 10      count           full    light
> > > >
> > > > where "light" meant the effort was shortcutted by the limit.
> > > > "light2" means the server had to compute one more tuple (or
> > > > the existance of) than "light1".
> > > >
> > > > In the above, I assume tuples as the unit of measure. Rob raised
> > > > the issue that (correct me here) the client could be doing some
> > > > inferencing which would give more [or fewer, i note] results than
> > > > specified in the limit. I think that's fine, that it does't reduce
> > > > the utility of specifying a limit.
> > > >
> > > > > > AndyS: seen this to prevent error conditions
> > > > > > EricP: SQL has it
> > > > > > RobS: there are other things in SQL that we don't have
> > > > > > Steve: can affect latency
> > > > > > AndyS: can be either a query or a protocol issue
> > > > > > EricP: arguing from the point of view of a straightforward
> > > > > >   implementation (for both client and server)
> > > > > > Yoshio: but then we need a way to specify the snapshot of the database
> > > > > > RobS: what is understood as a "result"
> > > > > > Kevin: do we want an exact upper bound? could cause more impl burden on
> > > > > > the server
> > > > > > DanC: does "upper bound" appeal to somebody?
> > > > > > Kevin: yes
> > > > > > FarrukhNajmi: +1
> > > > > > PatH: DQL could, by request, give a count of results
> > > > > > DanC: motivation to spend time on this is in discussing/bounderaries
> > > > > > with
> > > > > > other groups
> > > >
> > > > --
> > > > -eric
> > > >
> > > > office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
> > > > cell:   +1.857.222.5741
> > > >
> > > > (eric@w3.org)
> > > > Feel free to forward this message to any list for any purpose other than
> > > > email address distribution.
> > 
> > --
> > -eric
> > 
> > office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
> > cell:   +1.857.222.5741
> > 
> > (eric@w3.org)
> > Feel free to forward this message to any list for any purpose other than
> > email address distribution.

-- 
-eric

office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +1.857.222.5741

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Friday, 11 June 2004 00:33:00 UTC