Re: Real-world use case for 3.10(a) from Tom Adams on 2004-09-07 (public-rdf-dawg-comments@w3.org from September 2004)

From: Tom Adams <tom@tucanatech.com>
Date: Tue, 07 Sep 2004 14:00:38 +0000
To: "Chris Wilper" <cwilper@cs.cornell.edu>
Cc: public-rdf-dawg-comments@w3.org
Message-Id: <4AE7CD6E-00D6-11D9-ABF1-000A95C9112A@tucanatech.com>
Hi Chris,

Apologies for the late reply, but thanks for your posting to the DAWG 
comments list. The DAWG is always happy to receive comments and use 
cases on its proposed requirements.

The LIMIT  requirement you noted was moved from PENDING to APPROVED at 
the DAWG face to face on the 15th July. You can view the details at:

http://www.w3.org/2001/sw/DataAccess/ftf2#req

I would however like to ask you for more information regarding the use 
case you suggest.

> We have a large collection of metadata in a triplestore that we want 
> to make available to people through a set of queries.  Someone 
> typically asks, "Give me the metadata that has changed since last week 
> and is in XYZ collection", or simply, "Give me all the metadata".
>
> It is a requirement for us that the responses can come in chunks: XML 
> is sent over the wire, and rather than require all of our clients (and 
> our server) to be able to handle arbitrarily large chunks of xml in 
> one stream, our server can be configured to give only, say 1,000 
> responses, along with a
> "resumption token" that can be used for subsequent requests.

Firstly, some questions, do you require sorting on the results being 
returned? Or do you simply render the results as they come back from 
the store? Would you be happy sorting on the client? Or is too much of 
an overhead?

Are the chunks of data that come across something that you see the 
client requesting - as in LIMIT/OFFSET - and only displaying that small 
chunk, or is it something that can be handled by the protocol - as in 
the server sending back a small page of results at a time as needed? I 
guess I'm asking if you want to be able to hold the entire results in 
some data structure on the client, or whether it is OK to only get a 
limited subset, which makes it easier for the client to handle (i.e. 
less memory usage)?

You mention the use of a "resumption token", this implies some state 
keeping on the server, whereby the client queries, and gets a subset of 
results. When they want more, they issue some token (possibly 
implicitly stored in a session) that allows the server to resume 
sending the results where the client left off. In your use case, does 
the client need have a stable result set returned? i.e. Is it OK for 
the client to actually re-query, such that when doing a "LIMIT 50 
OFFSET 0", and subsequently a "LIMIT 50 OFFSET 50" they may get 
different results as the underlying data has changed.

I think your idea of a resumption token implies that you'd like a 
stable results set, but I wanted to be sure. Keeping state around is 
probably a large overhead for the server

> Without the ability to specify LIMITS/OFFSETS with the triplestore 
> query, we would need to stream everything to disk and manage much more 
> state within our application.

I believe that you are currently using Kowari as your triplestore, is 
the LIMIT and OFFSET functionality offered by Kowari what you envisage 
here? Does Kowari push enough of the details off onto the server? Or 
are you still managing too much state on the client?

Also, does streaming (3.12) helps your use case? Streaming would allow 
your client to process some of the results, without having to receive 
all of them first, giving you the effect of working over the entire 
result set without needing to receive it all first. It gives a stable 
result set, and queries are not reissued.

The DAWG charter makes mention of cursors on a result set:

http://www.w3.org/2003/12/swa/dawg-charter#protocol

I'd be interested in hearing your take on this, and whether this 
applies to your situation.

I think more generally, the WG would like more information on your use 
case, including anything important that you feel I may have missed in 
my questions above.

Thanks Chris, keep the comments coming!

Cheers,
Tom






On 06/07/2004, at 1:25 PM, Chris Wilper wrote:

> Hi,
>
> Looking at the Requirements/Use Cases document, I noticed that 3.10 
> and 3.10a
> had "Pending" status.  We[1] plan on using an rdf triplestore to back 
> a large
> metadata repository, exposed to other systems via the OAI-PMH[2].  
> While not
> being too domain and protocol-specific here, I'll describe our case:
>
> We have a large collection of metadata in a triplestore that we want to
> make available to people through a set of queries.  Someone typically 
> asks,
> "Give me the metadata that has changed since last week and is in XYZ
> collection", or simply, "Give me all the metadata".
>
> It is a requirement for us that the responses can come in chunks: XML 
> is
> sent over the wire, and rather than require all of our clients (and our
> server)
> to be able to handle arbitrarily large chunks of xml in one stream, our
> server
> can be configured to give only, say 1,000 responses, along with a
> "resumption token" that can be used for subsequent requests.
>
> Without the ability to specify LIMITS/OFFSETS with the triplestore 
> query, we
> would
> need to stream everything to disk and manage much more state within our
> application.
>
> [1] http://www.fedora.info/ and http://www.nsdl.org/
> [2] OAI-PMH is a protocol for exposing xml metadata in a repository.
>     See http://www.openarchives.org/OAI/openarchivesprotocol.html
>
> ___________________________________________
> Chris Wilper
> Cornell Digital Library Research Group
> http://www.cs.cornell.edu/~cwilper/
>
>
-- 
Tom Adams                  | Tucana Technologies, Inc.
Support Engineer           |   Office: +1 703 871 5312
tom@tucanatech.com         |     Cell: +1 571 594 0847
http://www.tucanatech.com  |      Fax: +1 877 290 6687
------------------------------------------------------
Received on Tuesday, 7 September 2004 14:45:18 UTC