Re: Memory problem with sequences in Calabash from Alex Muir on 2012-11-14 (xproc-dev@w3.org from November 2012)

From: Alex Muir <alex.g.muir@gmail.com>
Date: Wed, 14 Nov 2012 10:22:24 -0500
To: Conal Tuohy <conal.tuohy@versi.edu.au>
Cc: XProc Dev <xproc-dev@w3.org>
Message-ID: <CAFtPEJbWXHt76HKHCmPip+Fazx0++dsqJNVZsWmYYJPzKsqg6w@mail.gmail.com>

Hi Conal,

Perhaps if you could split the date range up into really small chunks and
create a script that executes GNUParallel to run multiple parallel
HTTP-queries. This at least would resolve memory issues and perhaps speed
the retrieval of data although I'm not sure of how it would fit in with the
transformations being done on the data. I suppose you could toss the data
chunks into an XML db for further processing.

Calabash does have a memory issue. You can find more information within the
mailing list archives and my last post.

Regards
Alex

On Sun, Nov 11, 2012 at 11:17 PM, Conal Tuohy <conal.tuohy@versi.edu.au>wrote:

> I've had a conceptual problem with an XProc I've written to perform
> OAI-PMH harvesting.
>
> For those who don't know, OAI-PMH is an HTTP-based protocol for publishing
> XML metadata records from digital library systems. Each request can specify
> a date range (so as to retrieve only records updated since a particular
> date), and the response contains a number (server defined, but typically
> tens or hundreds) of XML records, wrapped in some OAI-PMH XML. If the HTTP
> response would be "too large" (as defined by the server), the server
> returns just an initial page of records, along with a "resumption token"
> which allows the query to be resumed and another set of records retrieved
> (potentially also with a resumption token). In this way a large number of
> XML records can be transferred in batches of a reasonable size.
>
> In my OAI-PMH implementation, I aimed to encapsulate the repeated querying
> within a (recursive) step, which simply produces a sequence of XML records
> (there might be tens of thousands of them).  Then I have subsequent steps
> to transform the documents in that sequence, save them, etc. The "OAI-PMH
> harvest" step is a great abstraction to have.
>
> This all works rather nicely with small datasets, but with larger datasets
> the memory consumption is atrocious. The entire sequence seems to be
> buffered, such that large datasets can't actually be harvested.
>
> Is this actually a reasonable approach? Is this just a limitation of
> Calabash? Might another XProc processor handle it OK? Is there some way I
> could work around it without giving up the advantages that come with
> sequences?
>
> At the moment I'm restructuring the code so that the transformation and
> storage steps are performed within the harvesting step (which now no longer
> needs to have any outputs). Initial tests look promising, but it certainly
> lacks clarity and flexibility.
>
> Conal
>
> --
> Conal Tuohy
> eResearch Business Analyst
> Victorian eResearch Strategic Initiative
> +61-466324297
>
>
>

-- 
-

Alex G. Muir
Software Engineering Consultant
Linkedin Profile : http://ca.linkedin.com/pub/alex-muir/36/ab7/125
Love African Kora Music? Take a moment to listen to Gambia's - Amadu
Diabarte & Jali Bakary Konteh www.bafila.bandcamp.com Your support keeps
Africa's griot tradition alive... Cheers!

Received on Wednesday, 14 November 2012 15:22:51 UTC