Re: Proposed reply to Chris Wilper: Real-world use case for 3.10 from Tom Adams on 2004-08-17 (public-rdf-dawg@w3.org from July to September 2004)

From: Tom Adams <tom@tucanatech.com>
Date: Tue, 17 Aug 2004 10:18:10 -0400
To: public-rdf-dawg@w3.org
Message-Id: <446928E6-F058-11D8-AB7A-000A95C9112A@tucanatech.com>
Hi all,

A new response to Chris, based on my earlier comments, and Andy's 
suggestions, to chew on - see copious complete text below.

Andy: I'd appreciate your comments on this in light of your previous 
remarks, I think you understand some of the deeper issues better than 
me, especially anything that I failed to address in my questions.

----

Hi Chris,

Apologies for the late reply, but thanks for your posting to the DAWG 
comments list. The DAWG is always happy to receive comments and use 
cases on its proposed requirements.

The requirement you noted was moved from PENDING to APPROVED at the 
DAWG face to face on the 15th July. You can view the details at:

http://www.w3.org/2001/sw/DataAccess/ftf2#req

I would however like to ask you for more information regarding the use 
case you suggest.

> We have a large collection of metadata in a triplestore that we want 
> to make available to people through a set of queries.  Someone 
> typically asks, "Give me the metadata that has changed since last week 
> and is in XYZ collection", or simply, "Give me all the metadata".
>
> It is a requirement for us that the responses can come in chunks: XML 
> is sent over the wire, and rather than require all of our clients (and 
> our server) to be able to handle arbitrarily large chunks of xml in 
> one stream, our server can be configured to give only, say 1,000 
> responses, along with a
> "resumption token" that can be used for subsequent requests.

Firstly, some questions, do you require sorting on the results being 
returned? Or do you simply render the results as they come back from 
the store? Would you be happy sorting on the client? Or is too much of 
an overhead?

Are the chunks of data that come across something that you see the 
client requesting - as in LIMIT/OFFSET - and only displaying that small 
chunk, or is it something that can be handled by the protocol - as in 
the server sending back a small page of results at a time as needed? I 
guess I'm asking if you want to be able to hold the entire results in 
some data structure on the client, or whether it is OK to only get a 
limited subset, which makes it easier for the client to handle (i.e. 
less memory usage)?

You mention the use of a "resumption token", this implies some state 
keeping on the server, whereby the client queries, and gets a subset of 
results. When they want more, they issue some token (possibly 
implicitly stored in a session) that allows the server to resume 
sending the results where the client left off. In your use case, does 
the client need have a stable result set returned? i.e. Is it OK for 
the client to actually re-query, such that when doing a "LIMIT 50 
OFFSET 0", and subsequently a "LIMIT 50 OFFSET 50" they may get 
different results as the underlying data has changed.

I think your idea of a resumption token implies that you'd like a 
stable results set, but I wanted to be sure. Keeping state around is 
probably a large overhead for the server

> Without the ability to specify LIMITS/OFFSETS with the triplestore 
> query, we would need to stream everything to disk and manage much more 
> state within our application.

I believe that you are currently using Kowari as your triplestore, is 
the LIMIT and OFFSET functionality offered by Kowari what you envisage 
here? Does Kowari push enough of the details off onto the server? Or 
are you still managing too much state on the client?

The DAWG charter makes mention of cursors on a result set:

http://www.w3.org/2003/12/swa/dawg-charter#protocol

I'd be interested in hearing your take on this, and whether this 
applies to your situation.

I think more generally, the WG would like more information on your use 
case, including anything important that you feel I may have missed in 
my questions above.

Thanks Chris, keep the comments coming!

Cheers,
Tom







>> Chris asks for LIMIT and OFFSET in order to do client-side control of 
>> the flow of results.
>>
>> "3.10 Result Limits" is approved.
>> "3.12 Streaming Results" is approved
>>
>> we also noted the relationship to sorting matters.  But this isn’t 
>> LIMIT and OFFSET where the client asks for just a slice of the 
>> results, and then come back for another slice later.  The slices 
>> asked for need not be in order so result set stability across calls 
>> might be expected (transactions?).
>
> I don't think transactions are needed, but some kind of session-based 
> state keeping would be required.
>
>> It may be in Chris's use case that the client will ask for chunks in 
>> order, in which case streaming using a suitable XML encoding (that 
>> is, the whole document does not need to be stored before further 
>> processing) and LIMIT may be sufficient because the client can 
>> influence the results sufficiently, but it isn't what he is asking 
>> for.
>>
>> Illustration: Google lists for first 10 results, then you can jump 
>> around the "result set" using the page links at the bottom.
>
> I think that this may be what he's looking for.
>
>> Example: One style of facetted browsers show the first N results when 
>> the user has a lot of items in a category.  The client UI never 
>> retrieves the whole result set so just LIMIT is a win.
>>
>> The limitations on JDBC drivers noted in the F2F minutes applies in 
>> default configuration.  Having streams results has consequences - for 
>> MySQL that means locking over the length in time that the results are 
>> active with possibly adverse effects on the overall system 
>> performance.
>
> I'll defer to Simon on how Kowari handles this internally, perhaps 
> this can shed some light on the discussion, though perhaps he's 
> already covered it anecdotally.
>
>> I would like to understand Chris's use case better.  The use case has 
>> the client and server quite tightly designed together and possibly 
>> deployed.  It does not sound like a general browser-ish UI applied to 
>> some unknown RDF store.  It may be that LIMIT+Streaming is sufficient 
>> (not ideal, but tolerable)?  Alternatively, it may be we need 
>> different level in the protocol, with a simple, general web-wide one 
>> query, one response mode and than a more complex one for closer 
>> associations of client and server.
>
> I think Chris is after a combination of LIMIT and OFFSET. I know that 
> he's discussed this issue in the past on the Kowari list, and has just 
> posted a contribution (KModel), so I imagine this is what he's using.
>
> But yes, we need to find out more information on what he is doing. 
> I'll add asking for more information to my email.
>
>> We should also note charter item "2.3 Cursors and proofs" (I don't 
>> understand why cursors and proofs are lumped together).
>
> You're on the ball as ever :)
>
> Cheers,
> Tom
>
>
>> Tom Adams wrote:
>>
>>> Below is an outline of my proposed reply to Chris Wilper on his use 
>>> case for requirement 3.10, posted to 
>>> public-rdf-dawg-comments@w3.org.
>>> ----
>>> Hi Chris,
>>> Thanks for your posting to the DAWG comments list. The DAWG is 
>>> always happy to receive comments and use cases on its proposed 
>>> requirements.
>>> The requirement you noted was moved from PENDING to APPROVED at the 
>>> DAWG face to face on the 15th July. You can view the details at:
>>> http://www.w3.org/2001/sw/DataAccess/ftf2#req
>>> Keep the comments coming!
>>> Cheers,
>>> Tom
>>> On 06/07/2004, at 1:25 PM, Chris Wilper wrote:
>>>> Hi,
>>>>
>>>> Looking at the Requirements/Use Cases document, I noticed that 3.10 
>>>> and 3.10a
>>>> had "Pending" status.  We[1] plan on using an rdf triplestore to 
>>>> back a large
>>>> metadata repository, exposed to other systems via the OAI-PMH[2].  
>>>> While not
>>>> being too domain and protocol-specific here, I'll describe our case:
>>>>
>>>> We have a large collection of metadata in a triplestore that we want
>>> to
>>>> make available to people through a set of queries.  Someone 
>>>> typically asks,
>>>> "Give me the metadata that has changed since last week and is in XYZ
>>>> collection", or simply, "Give me all the metadata".
>>>>
>>>> It is a requirement for us that the responses can come in chunks: 
>>>> XML is
>>>> sent over the wire, and rather than require all of our clients (and
>>> our
>>>> server)
>>>> to be able to handle arbitrarily large chunks of xml in one stream,
>>> our
>>>> server
>>>> can be configured to give only, say 1,000 responses, along with a
>>>> "resumption token" that can be used for subsequent requests.
>>>>
>>>> Without the ability to specify LIMITS/OFFSETS with the triplestore 
>>>> query, we
>>>> would
>>>> need to stream everything to disk and manage much more state within
>>> our
>>>> application.
>>>>
>>>> [1] http://www.fedora.info/ and http://www.nsdl.org/
>>>> [2] OAI-PMH is a protocol for exposing xml metadata in a repository.
>>>>    See http://www.openarchives.org/OAI/openarchivesprotocol.html
>>>>
>>>> ___________________________________________
>>>> Chris Wilper
>>>> Cornell Digital Library Research Group
>>>> http://www.cs.cornell.edu/~cwilper/
Received on Tuesday, 17 August 2004 14:18:13 UTC