Re: Memory problem with sequences in Calabash from Conal Tuohy on 2012-11-12 (xproc-dev@w3.org from November 2012)

From: Conal Tuohy <conal.tuohy@versi.edu.au>
Date: Mon, 12 Nov 2012 23:41:43 +1000
To: xproc-dev@w3.org
Message-ID: <50A0FC97.8020306@versi.edu.au>

Reading some more, I discovered this Q&A:
http://stackoverflow.com/questions/878591/xml-streaming-with-xproc

It looks like QuiXProc would be my best bet for a streaming solution.


On 12/11/12 14:17, Conal Tuohy wrote:
> I've had a conceptual problem with an XProc I've written to perform 
> OAI-PMH harvesting.
>
> For those who don't know, OAI-PMH is an HTTP-based protocol for 
> publishing XML metadata records from digital library systems. Each 
> request can specify a date range (so as to retrieve only records 
> updated since a particular date), and the response contains a number 
> (server defined, but typically tens or hundreds) of XML records, 
> wrapped in some OAI-PMH XML. If the HTTP response would be "too large" 
> (as defined by the server), the server returns just an initial page of 
> records, along with a "resumption token" which allows the query to be 
> resumed and another set of records retrieved (potentially also with a 
> resumption token). In this way a large number of XML records can be 
> transferred in batches of a reasonable size.
>
> In my OAI-PMH implementation, I aimed to encapsulate the repeated 
> querying within a (recursive) step, which simply produces a sequence 
> of XML records (there might be tens of thousands of them).  Then I 
> have subsequent steps to transform the documents in that sequence, 
> save them, etc. The "OAI-PMH harvest" step is a great abstraction to 
> have.
>
> This all works rather nicely with small datasets, but with larger 
> datasets the memory consumption is atrocious. The entire sequence 
> seems to be buffered, such that large datasets can't actually be 
> harvested.
>
> Is this actually a reasonable approach? Is this just a limitation of 
> Calabash? Might another XProc processor handle it OK? Is there some 
> way I could work around it without giving up the advantages that come 
> with sequences?
>
> At the moment I'm restructuring the code so that the transformation 
> and storage steps are performed within the harvesting step (which now 
> no longer needs to have any outputs). Initial tests look promising, 
> but it certainly lacks clarity and flexibility.
>
> Conal
>


-- 
Conal Tuohy
eResearch Business Analyst
Victorian eResearch Strategic Initiative
+61-466324297

Received on Monday, 12 November 2012 13:42:17 UTC