- From: Conal Tuohy <conal.tuohy@versi.edu.au>
- Date: Mon, 12 Nov 2012 23:41:43 +1000
- To: xproc-dev@w3.org
Reading some more, I discovered this Q&A: http://stackoverflow.com/questions/878591/xml-streaming-with-xproc It looks like QuiXProc would be my best bet for a streaming solution. On 12/11/12 14:17, Conal Tuohy wrote: > I've had a conceptual problem with an XProc I've written to perform > OAI-PMH harvesting. > > For those who don't know, OAI-PMH is an HTTP-based protocol for > publishing XML metadata records from digital library systems. Each > request can specify a date range (so as to retrieve only records > updated since a particular date), and the response contains a number > (server defined, but typically tens or hundreds) of XML records, > wrapped in some OAI-PMH XML. If the HTTP response would be "too large" > (as defined by the server), the server returns just an initial page of > records, along with a "resumption token" which allows the query to be > resumed and another set of records retrieved (potentially also with a > resumption token). In this way a large number of XML records can be > transferred in batches of a reasonable size. > > In my OAI-PMH implementation, I aimed to encapsulate the repeated > querying within a (recursive) step, which simply produces a sequence > of XML records (there might be tens of thousands of them). Then I > have subsequent steps to transform the documents in that sequence, > save them, etc. The "OAI-PMH harvest" step is a great abstraction to > have. > > This all works rather nicely with small datasets, but with larger > datasets the memory consumption is atrocious. The entire sequence > seems to be buffered, such that large datasets can't actually be > harvested. > > Is this actually a reasonable approach? Is this just a limitation of > Calabash? Might another XProc processor handle it OK? Is there some > way I could work around it without giving up the advantages that come > with sequences? > > At the moment I'm restructuring the code so that the transformation > and storage steps are performed within the harvesting step (which now > no longer needs to have any outputs). Initial tests look promising, > but it certainly lacks clarity and flexibility. > > Conal > -- Conal Tuohy eResearch Business Analyst Victorian eResearch Strategic Initiative +61-466324297
Received on Monday, 12 November 2012 13:42:17 UTC