- From: Tom Adams <tom@tucanatech.com>
- Date: Tue, 07 Sep 2004 14:00:38 +0000
- To: "Chris Wilper" <cwilper@cs.cornell.edu>
- Cc: public-rdf-dawg-comments@w3.org
Hi Chris, Apologies for the late reply, but thanks for your posting to the DAWG comments list. The DAWG is always happy to receive comments and use cases on its proposed requirements. The LIMIT requirement you noted was moved from PENDING to APPROVED at the DAWG face to face on the 15th July. You can view the details at: http://www.w3.org/2001/sw/DataAccess/ftf2#req I would however like to ask you for more information regarding the use case you suggest. > We have a large collection of metadata in a triplestore that we want > to make available to people through a set of queries. Someone > typically asks, "Give me the metadata that has changed since last week > and is in XYZ collection", or simply, "Give me all the metadata". > > It is a requirement for us that the responses can come in chunks: XML > is sent over the wire, and rather than require all of our clients (and > our server) to be able to handle arbitrarily large chunks of xml in > one stream, our server can be configured to give only, say 1,000 > responses, along with a > "resumption token" that can be used for subsequent requests. Firstly, some questions, do you require sorting on the results being returned? Or do you simply render the results as they come back from the store? Would you be happy sorting on the client? Or is too much of an overhead? Are the chunks of data that come across something that you see the client requesting - as in LIMIT/OFFSET - and only displaying that small chunk, or is it something that can be handled by the protocol - as in the server sending back a small page of results at a time as needed? I guess I'm asking if you want to be able to hold the entire results in some data structure on the client, or whether it is OK to only get a limited subset, which makes it easier for the client to handle (i.e. less memory usage)? You mention the use of a "resumption token", this implies some state keeping on the server, whereby the client queries, and gets a subset of results. When they want more, they issue some token (possibly implicitly stored in a session) that allows the server to resume sending the results where the client left off. In your use case, does the client need have a stable result set returned? i.e. Is it OK for the client to actually re-query, such that when doing a "LIMIT 50 OFFSET 0", and subsequently a "LIMIT 50 OFFSET 50" they may get different results as the underlying data has changed. I think your idea of a resumption token implies that you'd like a stable results set, but I wanted to be sure. Keeping state around is probably a large overhead for the server > Without the ability to specify LIMITS/OFFSETS with the triplestore > query, we would need to stream everything to disk and manage much more > state within our application. I believe that you are currently using Kowari as your triplestore, is the LIMIT and OFFSET functionality offered by Kowari what you envisage here? Does Kowari push enough of the details off onto the server? Or are you still managing too much state on the client? Also, does streaming (3.12) helps your use case? Streaming would allow your client to process some of the results, without having to receive all of them first, giving you the effect of working over the entire result set without needing to receive it all first. It gives a stable result set, and queries are not reissued. The DAWG charter makes mention of cursors on a result set: http://www.w3.org/2003/12/swa/dawg-charter#protocol I'd be interested in hearing your take on this, and whether this applies to your situation. I think more generally, the WG would like more information on your use case, including anything important that you feel I may have missed in my questions above. Thanks Chris, keep the comments coming! Cheers, Tom On 06/07/2004, at 1:25 PM, Chris Wilper wrote: > Hi, > > Looking at the Requirements/Use Cases document, I noticed that 3.10 > and 3.10a > had "Pending" status. We[1] plan on using an rdf triplestore to back > a large > metadata repository, exposed to other systems via the OAI-PMH[2]. > While not > being too domain and protocol-specific here, I'll describe our case: > > We have a large collection of metadata in a triplestore that we want to > make available to people through a set of queries. Someone typically > asks, > "Give me the metadata that has changed since last week and is in XYZ > collection", or simply, "Give me all the metadata". > > It is a requirement for us that the responses can come in chunks: XML > is > sent over the wire, and rather than require all of our clients (and our > server) > to be able to handle arbitrarily large chunks of xml in one stream, our > server > can be configured to give only, say 1,000 responses, along with a > "resumption token" that can be used for subsequent requests. > > Without the ability to specify LIMITS/OFFSETS with the triplestore > query, we > would > need to stream everything to disk and manage much more state within our > application. > > [1] http://www.fedora.info/ and http://www.nsdl.org/ > [2] OAI-PMH is a protocol for exposing xml metadata in a repository. > See http://www.openarchives.org/OAI/openarchivesprotocol.html > > ___________________________________________ > Chris Wilper > Cornell Digital Library Research Group > http://www.cs.cornell.edu/~cwilper/ > > -- Tom Adams | Tucana Technologies, Inc. Support Engineer | Office: +1 703 871 5312 tom@tucanatech.com | Cell: +1 571 594 0847 http://www.tucanatech.com | Fax: +1 877 290 6687 ------------------------------------------------------
Received on Tuesday, 7 September 2004 14:45:18 UTC