- From: Conal Tuohy <conal.tuohy@versi.edu.au>
- Date: Mon, 12 Nov 2012 14:17:40 +1000
- To: XProc Dev <xproc-dev@w3.org>
I've had a conceptual problem with an XProc I've written to perform OAI-PMH harvesting. For those who don't know, OAI-PMH is an HTTP-based protocol for publishing XML metadata records from digital library systems. Each request can specify a date range (so as to retrieve only records updated since a particular date), and the response contains a number (server defined, but typically tens or hundreds) of XML records, wrapped in some OAI-PMH XML. If the HTTP response would be "too large" (as defined by the server), the server returns just an initial page of records, along with a "resumption token" which allows the query to be resumed and another set of records retrieved (potentially also with a resumption token). In this way a large number of XML records can be transferred in batches of a reasonable size. In my OAI-PMH implementation, I aimed to encapsulate the repeated querying within a (recursive) step, which simply produces a sequence of XML records (there might be tens of thousands of them). Then I have subsequent steps to transform the documents in that sequence, save them, etc. The "OAI-PMH harvest" step is a great abstraction to have. This all works rather nicely with small datasets, but with larger datasets the memory consumption is atrocious. The entire sequence seems to be buffered, such that large datasets can't actually be harvested. Is this actually a reasonable approach? Is this just a limitation of Calabash? Might another XProc processor handle it OK? Is there some way I could work around it without giving up the advantages that come with sequences? At the moment I'm restructuring the code so that the transformation and storage steps are performed within the harvesting step (which now no longer needs to have any outputs). Initial tests look promising, but it certainly lacks clarity and flexibility. Conal -- Conal Tuohy eResearch Business Analyst Victorian eResearch Strategic Initiative +61-466324297
Received on Monday, 12 November 2012 04:18:13 UTC