Re: Proposed reply to Chris Wilper: Real-world use case for 3.10 from Seaborne, Andy on 2004-08-23 (public-rdf-dawg@w3.org from July to September 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Mon, 23 Aug 2004 11:52:22 +0100
To: Tom Adams <tom@tucanatech.com>
CC: public-rdf-dawg@w3.org
Message-ID: <4129CC66.9050801@hp.com>
Tom Adams wrote:
> 
> Hi all,
> 
> A new response to Chris, based on my earlier comments, and Andy's 
> suggestions, to chew on - see copious complete text below.
> 
> Andy: I'd appreciate your comments on this in light of your previous 
> remarks, I think you understand some of the deeper issues better than 
> me, especially anything that I failed to address in my questions.

It would be worth asking whether streaming (3.12) helps the use case. 
Streaming allows the client to process some of the results, without 
having to receive all of them first.

Streaming can be used to get the effect of working over the entire 
result set without needing to receive it all first.   It gives a stable 
result set, and queries are not reissued.  Streaming is similar (to the 
client) to LIMIT/OFFSET when the blocks are asked for in order.

It still isn't easy to have servers that do streaming well in all cases 
but it does mean that some flow control is addressed at the lower level, 
and some is just server-internal implementation; its not a matter of the 
query protocol and hence a requirement of all servers.

	Andy

> 
> ----
> 
> Hi Chris,
> 
> Apologies for the late reply, but thanks for your posting to the DAWG 
> comments list. The DAWG is always happy to receive comments and use 
> cases on its proposed requirements.
> 
> The requirement you noted was moved from PENDING to APPROVED at the DAWG 
> face to face on the 15th July. You can view the details at:
> 
> http://www.w3.org/2001/sw/DataAccess/ftf2#req

It is worth nothing that what was approved is LIMIT, not LIMIT & OFFSET.

> 
> I would however like to ask you for more information regarding the use 
> case you suggest.
> 
>> We have a large collection of metadata in a triplestore that we want 
>> to make available to people through a set of queries.  Someone 
>> typically asks, "Give me the metadata that has changed since last week 
>> and is in XYZ collection", or simply, "Give me all the metadata".
>>
>> It is a requirement for us that the responses can come in chunks: XML 
>> is sent over the wire, and rather than require all of our clients (and 
>> our server) to be able to handle arbitrarily large chunks of xml in 
>> one stream, our server can be configured to give only, say 1,000 
>> responses, along with a
>> "resumption token" that can be used for subsequent requests.
> 
> 
> Firstly, some questions, do you require sorting on the results being 
> returned? Or do you simply render the results as they come back from the 
> store? Would you be happy sorting on the client? Or is too much of an 
> overhead?

Good point. Information on the whole area of sorting would be good, 
separately from this comment.  e.g. sorting strings against URIs and 
numbers.

> 
> Are the chunks of data that come across something that you see the 
> client requesting - as in LIMIT/OFFSET - and only displaying that small 
> chunk, or is it something that can be handled by the protocol - as in 
> the server sending back a small page of results at a time as needed? I 
> guess I'm asking if you want to be able to hold the entire results in 
> some data structure on the client, or whether it is OK to only get a 
> limited subset, which makes it easier for the client to handle (i.e. 
> less memory usage)?
> 
> You mention the use of a "resumption token", this implies some state 
> keeping on the server, whereby the client queries, and gets a subset of 
> results. When they want more, they issue some token (possibly implicitly 
> stored in a session) that allows the server to resume sending the 
> results where the client left off. In your use case, does the client 
> need have a stable result set returned? i.e. Is it OK for the client to 
> actually re-query, such that when doing a "LIMIT 50 OFFSET 0", and 
> subsequently a "LIMIT 50 OFFSET 50" they may get different results as 
> the underlying data has changed.
> 
> I think your idea of a resumption token implies that you'd like a stable 
> results set, but I wanted to be sure. Keeping state around is probably a 
> large overhead for the server
> 
>> Without the ability to specify LIMITS/OFFSETS with the triplestore 
>> query, we would need to stream everything to disk and manage much more 
>> state within our application.
> 
> 
> I believe that you are currently using Kowari as your triplestore, is 
> the LIMIT and OFFSET functionality offered by Kowari what you envisage 
> here? Does Kowari push enough of the details off onto the server? Or are 
> you still managing too much state on the client?
> 
> The DAWG charter makes mention of cursors on a result set:
> 
> http://www.w3.org/2003/12/swa/dawg-charter#protocol
> 
> I'd be interested in hearing your take on this, and whether this applies 
> to your situation.
> 
> I think more generally, the WG would like more information on your use 
> case, including anything important that you feel I may have missed in my 
> questions above.
> 
> Thanks Chris, keep the comments coming!
> 
> Cheers,
> Tom
Received on Monday, 23 August 2004 10:53:05 UTC