RE: Memory problem with sequences in Calabash from Geert Josten on 2012-11-14 (xproc-dev@w3.org from November 2012)

From: Geert Josten <geert.josten@dayon.nl>
Date: Wed, 14 Nov 2012 15:14:12 +0100
To: Conal Tuohy <conal.tuohy@versi.edu.au>, xproc-dev@w3.org
Message-ID: <ea94d0aeda2b8f0678ea86a0e0d0e2fd@mail.gmail.com>
Hi Conal,

Sounds like an interesting piece of code. Is it by any chance your
intention to share it with the community?

I guess the main problem is that you try to do all in a single process.
You are gathering many results, which are probably not written
intermediately. It might already have helped your case if it were possible
to invoke a Calabash subprocess, to take care of doing the work for one
single download, storing the result on disk. It might be possible to mimic
something like that with p:exec, but that is not very nice.

Quixproc might work as well, but not sure how well it covers the standard,
nor whether it has any of the nice extensions Calabash provides, in case
you are using any..

Cheers,
Geert

> -----Oorspronkelijk bericht-----
> Van: Conal Tuohy [mailto:conal.tuohy@gmail.com] Namens Conal Tuohy
> Verzonden: maandag 12 november 2012 14:42
> Aan: xproc-dev@w3.org
> Onderwerp: Re: Memory problem with sequences in Calabash
>
> Reading some more, I discovered this Q&A:
> http://stackoverflow.com/questions/878591/xml-streaming-with-xproc
>
> It looks like QuiXProc would be my best bet for a streaming solution.
>
>
> On 12/11/12 14:17, Conal Tuohy wrote:
> > I've had a conceptual problem with an XProc I've written to perform
> > OAI-PMH harvesting.
> >
> > For those who don't know, OAI-PMH is an HTTP-based protocol for
> > publishing XML metadata records from digital library systems. Each
> > request can specify a date range (so as to retrieve only records
> > updated since a particular date), and the response contains a number
> > (server defined, but typically tens or hundreds) of XML records,
> > wrapped in some OAI-PMH XML. If the HTTP response would be "too large"
> > (as defined by the server), the server returns just an initial page of
> > records, along with a "resumption token" which allows the query to be
> > resumed and another set of records retrieved (potentially also with a
> > resumption token). In this way a large number of XML records can be
> > transferred in batches of a reasonable size.
> >
> > In my OAI-PMH implementation, I aimed to encapsulate the repeated
> > querying within a (recursive) step, which simply produces a sequence
> > of XML records (there might be tens of thousands of them).  Then I
> > have subsequent steps to transform the documents in that sequence,
> > save them, etc. The "OAI-PMH harvest" step is a great abstraction to
> > have.
> >
> > This all works rather nicely with small datasets, but with larger
> > datasets the memory consumption is atrocious. The entire sequence
> > seems to be buffered, such that large datasets can't actually be
> > harvested.
> >
> > Is this actually a reasonable approach? Is this just a limitation of
> > Calabash? Might another XProc processor handle it OK? Is there some
> > way I could work around it without giving up the advantages that come
> > with sequences?
> >
> > At the moment I'm restructuring the code so that the transformation
> > and storage steps are performed within the harvesting step (which now
> > no longer needs to have any outputs). Initial tests look promising,
> > but it certainly lacks clarity and flexibility.
> >
> > Conal
> >
>
>
> --
> Conal Tuohy
> eResearch Business Analyst
> Victorian eResearch Strategic Initiative
> +61-466324297
>
Received on Wednesday, 14 November 2012 14:14:37 UTC