Memory problem with sequences in Calabash

I've had a conceptual problem with an XProc I've written to perform 
OAI-PMH harvesting.

For those who don't know, OAI-PMH is an HTTP-based protocol for 
publishing XML metadata records from digital library systems. Each 
request can specify a date range (so as to retrieve only records updated 
since a particular date), and the response contains a number (server 
defined, but typically tens or hundreds) of XML records, wrapped in some 
OAI-PMH XML. If the HTTP response would be "too large" (as defined by 
the server), the server returns just an initial page of records, along 
with a "resumption token" which allows the query to be resumed and 
another set of records retrieved (potentially also with a resumption 
token). In this way a large number of XML records can be transferred in 
batches of a reasonable size.

In my OAI-PMH implementation, I aimed to encapsulate the repeated 
querying within a (recursive) step, which simply produces a sequence of 
XML records (there might be tens of thousands of them).  Then I have 
subsequent steps to transform the documents in that sequence, save them, 
etc. The "OAI-PMH harvest" step is a great abstraction to have.

This all works rather nicely with small datasets, but with larger 
datasets the memory consumption is atrocious. The entire sequence seems 
to be buffered, such that large datasets can't actually be harvested.

Is this actually a reasonable approach? Is this just a limitation of 
Calabash? Might another XProc processor handle it OK? Is there some way 
I could work around it without giving up the advantages that come with 
sequences?

At the moment I'm restructuring the code so that the transformation and 
storage steps are performed within the harvesting step (which now no 
longer needs to have any outputs). Initial tests look promising, but it 
certainly lacks clarity and flexibility.

Conal

-- 
Conal Tuohy
eResearch Business Analyst
Victorian eResearch Strategic Initiative
+61-466324297

Received on Monday, 12 November 2012 04:18:13 UTC