- From: Alex Muir <alex.g.muir@gmail.com>
- Date: Wed, 14 Nov 2012 10:22:24 -0500
- To: Conal Tuohy <conal.tuohy@versi.edu.au>
- Cc: XProc Dev <xproc-dev@w3.org>
- Message-ID: <CAFtPEJbWXHt76HKHCmPip+Fazx0++dsqJNVZsWmYYJPzKsqg6w@mail.gmail.com>
Hi Conal, Perhaps if you could split the date range up into really small chunks and create a script that executes GNUParallel to run multiple parallel HTTP-queries. This at least would resolve memory issues and perhaps speed the retrieval of data although I'm not sure of how it would fit in with the transformations being done on the data. I suppose you could toss the data chunks into an XML db for further processing. Calabash does have a memory issue. You can find more information within the mailing list archives and my last post. Regards Alex On Sun, Nov 11, 2012 at 11:17 PM, Conal Tuohy <conal.tuohy@versi.edu.au>wrote: > I've had a conceptual problem with an XProc I've written to perform > OAI-PMH harvesting. > > For those who don't know, OAI-PMH is an HTTP-based protocol for publishing > XML metadata records from digital library systems. Each request can specify > a date range (so as to retrieve only records updated since a particular > date), and the response contains a number (server defined, but typically > tens or hundreds) of XML records, wrapped in some OAI-PMH XML. If the HTTP > response would be "too large" (as defined by the server), the server > returns just an initial page of records, along with a "resumption token" > which allows the query to be resumed and another set of records retrieved > (potentially also with a resumption token). In this way a large number of > XML records can be transferred in batches of a reasonable size. > > In my OAI-PMH implementation, I aimed to encapsulate the repeated querying > within a (recursive) step, which simply produces a sequence of XML records > (there might be tens of thousands of them). Then I have subsequent steps > to transform the documents in that sequence, save them, etc. The "OAI-PMH > harvest" step is a great abstraction to have. > > This all works rather nicely with small datasets, but with larger datasets > the memory consumption is atrocious. The entire sequence seems to be > buffered, such that large datasets can't actually be harvested. > > Is this actually a reasonable approach? Is this just a limitation of > Calabash? Might another XProc processor handle it OK? Is there some way I > could work around it without giving up the advantages that come with > sequences? > > At the moment I'm restructuring the code so that the transformation and > storage steps are performed within the harvesting step (which now no longer > needs to have any outputs). Initial tests look promising, but it certainly > lacks clarity and flexibility. > > Conal > > -- > Conal Tuohy > eResearch Business Analyst > Victorian eResearch Strategic Initiative > +61-466324297 > > > -- - Alex G. Muir Software Engineering Consultant Linkedin Profile : http://ca.linkedin.com/pub/alex-muir/36/ab7/125 Love African Kora Music? Take a moment to listen to Gambia's - Amadu Diabarte & Jali Bakary Konteh www.bafila.bandcamp.com Your support keeps Africa's griot tradition alive... Cheers!
Received on Wednesday, 14 November 2012 15:22:51 UTC