W3C home > Mailing lists > Public > xproc-dev@w3.org > June 2010

Re: Can one within a for-each loop wrap, output, sink a set of files and continue processing with remaining files?

From: Alex Muir <alex.g.muir@gmail.com>
Date: Fri, 4 Jun 2010 12:24:16 +0000
Message-ID: <AANLkTimiKxhnT6bFNp786PIyoYwjWU0K_QbAYFW6JvE2@mail.gmail.com>
To: Romain Deltour <rdeltour@gmail.com>
Cc: xproc-dev@w3.org
Thanks, looks good!

I admit the recursive solution was giving me pause to implement.

On Fri, Jun 4, 2010 at 11:47 AM, Romain Deltour <rdeltour@gmail.com> wrote:

> Would the solution, to have to read all input files in before processing
> the first set, be poor in terms of memory use?
>
>
> You can improve the pipeline depending on the most resource intensive step.
> If you want to reduce the number of XML documents parsed in memory, an
> alternative could be to work on the sequence of file paths returned by the
> p:directory-list rather than on the sequence of document. In other words,
> you would move the resource-intensive p:load from the first p:for-each to
> the second:
>
> p:for-each => to create a sequence of 100 paths from the flat list returned
> by p:directory-list
> (note the result of this first p:for-each is a sequence of 1000 documents)
> p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into 200-sets
> p:for-each => another iteration over the 5 packs of 200 files, to process
> each pack at a time, loading the document then processing it
>
> Vojtech's idea of using recursion sounds good to.
>
> Romain.
>
> Le 4 juin 10 à 11:27, Alex Muir a écrit :
>
> Hi Romain,
>
> Your solution looks like a good one and your not missing any points.
>
> Would the solution, to have to read all input files in before processing
> the first set, be poor in terms of memory use?
>
> There is no way to read in the first 200 and process them and read in the
> second 200 and process those and so on?
>
> Thanks
> Alex
>
> On Thu, Jun 3, 2010 at 6:43 PM, Romain Deltour <rdeltour@gmail.com> wrote:
>
>> Hi Alex,
>>
>> If I'm understanding correctly your intent and your pipeline, you should
>> rather use the @group-adjacent attribute of the p:wrap-sequence step to pack
>> 200 files at a time.
>>
>> Explanation:
>> In your pipeline, almost everything happens in one big p:for-each that
>> iterates over the 1000 files. The p:choose subpipeline is executed only
>> every 200 file, and the wrapper's input is a sequence of this unique file
>> (modulo 200).
>> Actually, rather that grouping files by sets of 200, you ignore 199 files
>> and wrap only the 200th in an element before processing it.
>>
>> What I would do is:
>>
>> p:for-each => to iterate through the 1000 files and load the documents
>> (note the result of this first p:for-each is a sequence of 1000 documents)
>> p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into
>> 200-sets
>> p:for-each => another iteration over the 5 packs of 200 files, to process
>> each pack at a time
>>
>> I hope this helps and I'm not missing your point...
>>
>> BR
>> Romain.
>>
>> Le 3 juin 10 à 18:32, Alex Muir a écrit :
>>
>> Hi,
>>
>> I'm trying to read ~10000 files within a for-each loop, wrap a selection
>> from each set of 200 files and process them to output 1 html file, sink the
>> processed files and continue with the remaining files processing 200 at a
>> time.
>>
>> Is that possible in xproc?
>>
>> I've got something like the following which I can't get to work. I think
>> that wrapper cannot be used within a for-each, is that the case?
>>
>> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:c="
>> http://www.w3.org/ns/xproc-step"
>>     xmlns:cx="http://xmlcalabash.com/ns/extensions"
>> name="wrapWithinForEach" version="1.0">
>>
>>     <p:input port="source">
>>         <p:inline>
>>             <xml/>
>>         </p:inline>
>>     </p:input>
>>
>>     <p:output port="result" sequence="true"/>
>>
>>     <p:declare-step type="cx:message" version="1.0">
>>         <p:input port="source"/>
>>         <p:output port="result"/>
>>         <p:option name="message" required="true"/>
>>     </p:declare-step>
>>
>>
>>     <!-- ***** Starting and Ending File Numbers ***** -->
>>     <p:variable name="startingFileNumber" select="'1'"/>
>>     <p:variable name="endingFileNumber" select="'10000'"/>
>>     <p:variable name="numberPerFile" select="'200'"/>
>>
>>     <!-- source and output folder variables -->
>>     <p:variable name="source-folder" select="'completed/XML/'"/>
>>     <p:variable name="output-folder" select="'MDNA/'"/>
>>     <p:variable name="error-folder" select="'MDNA/error/'"/>
>>     <p:variable name="exception-folder" select="'MDNA/exception/'"/>
>>
>>
>>     <p:directory-list>
>>         <p:with-option name="path" select="$source-folder">
>>             <p:empty/>
>>         </p:with-option>
>>     </p:directory-list>
>>
>>
>>     <p:for-each name="MDNA">
>>
>>
>>         <p:iteration-source
>>             select="//c:file[position() ge number($startingFileNumber) and
>> position() le number($endingFileNumber)]"/>
>>
>>         <p:variable name="fileName" select="c:file/@name"/>
>>         <p:variable name="startingIterationPosition"
>>             select="number(p:iteration-position()) +
>> number($startingFileNumber)-1"/>
>>
>>        <cx:message>
>>             <p:with-option name="message"
>>                 select="concat('-----------------------------',
>> 'Iteration-position:','  ', $startingIterationPosition, '  File: ',
>> $fileName,'-----------------------------')"
>>             />
>>         </cx:message>
>>
>>         <p:load>
>>             <p:with-option name="href"
>> select="concat($source-folder,$fileName)"/>
>>         </p:load>
>>
>>         <cx:message>
>>             <p:with-option name="message" select="'######
>> ExtractContent'"/>
>>         </cx:message>
>>         <p:xslt name="ExtractContent">
>>             <p:input port="source"/>
>>             <p:input port="stylesheet">
>>                 <p:document href="ExtractContent.xsl"/>
>>             </p:input>
>>             <p:input port="parameters">
>>                 <p:empty/>
>>             </p:input>
>>         </p:xslt>
>>
>>         <p:identity name="wrap"/>
>>
>>
>>         <p:choose>
>>             <p:when test="position() mod $numberPerFile eq 0">
>>                 <p:wrap-sequence wrapper="WRAP" name="wrapper">
>>                     <p:input port="source">
>>                         <p:pipe port="result" step="wrap"/>
>>                     </p:input>
>>                 </p:wrap-sequence>
>>
>>
>>                 <p:xslt name="CreateHTML">
>>                     <p:input port="source"/>
>>                     <p:input port="stylesheet">
>>                         <p:document href="CreateHTML.xsl"/>
>>                     </p:input>
>>                     <p:input port="parameters">
>>                         <p:empty/>
>>                     </p:input>
>>                 </p:xslt>
>>
>>
>>                 <p:identity name="out_file"/>
>>
>>                 <p:store name="OUT">
>>                     <p:with-option name="href"
>>                         select="concat($output-folder,
>> 'MDNASections','-',$startingFileNumber,'-' ,$endingFileNumber,'.html')">
>>                         <p:pipe step="out_file" port="result"/>
>>                     </p:with-option>
>>                 </p:store>
>>
>>                 <p:sink name="sinkIt"/>
>>
>>             </p:when>
>>         </p:choose>
>>
>>     </p:for-each>
>>
>>
>> </p:declare-step>
>>
>>
>>
>>
>> Regards
>>
>>
>> --
>> Alex
>>
>> An informal recording with one mic under a tree leads to some pretty sweet
>> acoustic sounds.
>> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>>
>>
>>
>
>
> --
> Alex
>
> An informal recording with one mic under a tree leads to some pretty sweet
> acoustic sounds.
> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>
>
>


-- 
Alex

An informal recording with one mic under a tree leads to some pretty sweet
acoustic sounds.
https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
Received on Friday, 4 June 2010 12:24:49 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 4 June 2010 12:24:50 GMT