W3C home > Mailing lists > Public > xproc-dev@w3.org > June 2010

Re: Can one within a for-each loop wrap, output, sink a set of files and continue processing with remaining files?

From: Romain Deltour <rdeltour@gmail.com>
Date: Fri, 4 Jun 2010 22:24:44 +0200
Message-Id: <148ADAAE-0F4B-4408-853E-11C0F16437C5@gmail.com>
To: xproc-dev@w3.org
Hi again,

I tried a pure XProc equivalent of your XSLT (a little practice never  
hurts ;-), and here's the result (it's shorter too):

  <p:for-each>
     <p:iteration-source select="//c:file"/>
     <p:identity/>
  </p:for-each>
<p:wrap-sequence wrapper="c:group" group- 
adjacent="xs:integer((position()-1) div 2)"/>
<p:wrap-sequence wrapper="c:files"/>

1. The first p:for-each split the flat list in a sequence of c:file  
documents.
2. The first p:wrap-sequence creates a sequence of 2-packs using the  
group-adjacent feature.
3. The last p:wrap-sequence wraps the sequence of 2-packs in a single  
document

Romain.

PS: there seems to be a bug in Calabash, which doesn't allows using  
variables in the @group-adjacent expression


Le 4 juin 10 à 18:25, Alex Muir a écrit :

> Hi,
>
> Well I ended up modifying the p:directory list with a p:xslt given I  
> didn't know how to do it using xproc and it's easy.
>
> So this xproc isn't yet processing the files (will get to that now)  
> but does the first step,, reads the directory list and groups n file  
> names pre group using the param filePerGroup.  See xproc, xslt and  
> output below.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:cx="http://xmlcalabash.com/ns/extensions 
> "
>     xmlns:c="http://www.w3.org/ns/xproc-step" name="chunk"  
> version="1.0">
>
>     <p:input port="source">
>         <p:empty/>
>     </p:input>
>
>     <p:variable name="source-folder" select="'in/'"/>
>     <p:variable name="output-folder" select="'out/'"/>
>
>     <p:directory-list>
>         <p:with-option name="path" select="$source-folder">
>             <p:empty/>
>         </p:with-option>
>     </p:directory-list>
>
>     <p:xslt version="1.0" name="chunkFiles">
>         <p:input port="stylesheet">
>             <p:document href="chunkFiles.xsl"/>
>         </p:input>
>         <p:with-param name="filePerGroup" select="2"/>
>         <p:input port="parameters">
>             <p:empty/>
>         </p:input>
>     </p:xslt>
>
>
>     <p:store name="store">
>         <p:with-option name="href" select="concat($output- 
> folder,'directory-list.xml')"/>
>     </p:store>
>
> </p:declare-step>
>
>
> XSLT FILE:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>     xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result- 
> prefixes="#all"
>     xmlns:c="http://www.w3.org/ns/xproc-step" version="2.0">
>     <xsl:output method="xml" indent="yes"/>
>     <xsl:param name="filePerGroup"/>
>     <xsl:template match="/c:directory">
>         <xsl:variable name="directory" as="element()*">
>
>             <chunk/>
>             <xsl:for-each select="c:file">
>
>                 <file>
>                     <xsl:copy-of select="@*"/>
>                 </file>
>                 <xsl:if test="exists(following-sibling::c:file) and  
> position() mod xs:integer($filePerGroup) eq 0">
>                     <chunk/>
>                 </xsl:if>
>
>             </xsl:for-each>
>
>         </xsl:variable>
>
>         <!--
>         <xsl:copy-of select="$directory"/>-->
>
>
>         <files>
>             <xsl:for-each-group select="$directory" group-starting- 
> with="chunk">
>                 <group>
>                     <xsl:for-each select="current-group()">
>                         <xsl:if test="self::file">
>                             <file>
>                                 <xsl:apply-templates select="file| 
> @*"/>
>                             </file>
>                         </xsl:if>
>                     </xsl:for-each>
>                 </group>
>             </xsl:for-each-group>
>         </files>
>
>     </xsl:template>
> </xsl:stylesheet>
>
> EXAMPLE OUTPUT
>
> <files>
>     <group>
>         <file>ONE.xml</file>
>         <file>TWO.xml</file>
>     </group>
>     <group>
>         <file>THREE.xml</file>
>         <file>FOUR.xml</file>
>     </group>
>     <group>
>         <file>FIVE.xml</file>
>         <file>SIX.xml</file>
>     </group>
> </files>
>
>
>
>
> On Fri, Jun 4, 2010 at 12:24 PM, Alex Muir <alex.g.muir@gmail.com>  
> wrote:
> Thanks, looks good!
>
> I admit the recursive solution was giving me pause to implement.
>
>
> On Fri, Jun 4, 2010 at 11:47 AM, Romain Deltour <rdeltour@gmail.com>  
> wrote:
>> Would the solution, to have to read all input files in before  
>> processing the first set, be poor in terms of memory use?
>
>
> You can improve the pipeline depending on the most resource  
> intensive step. If you want to reduce the number of XML documents  
> parsed in memory, an alternative could be to work on the sequence of  
> file paths returned by the p:directory-list rather than on the  
> sequence of document. In other words, you would move the resource- 
> intensive p:load from the first p:for-each to the second:
>
> p:for-each => to create a sequence of 100 paths from the flat list  
> returned by p:directory-list
> (note the result of this first p:for-each is a sequence of 1000  
> documents)
> p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into  
> 200-sets
> p:for-each => another iteration over the 5 packs of 200 files, to  
> process each pack at a time, loading the document then processing it
>
> Vojtech's idea of using recursion sounds good to.
>
> Romain.
>
> Le 4 juin 10 à 11:27, Alex Muir a écrit :
>
>> Hi Romain,
>>
>> Your solution looks like a good one and your not missing any points.
>>
>> Would the solution, to have to read all input files in before  
>> processing the first set, be poor in terms of memory use?
>>
>> There is no way to read in the first 200 and process them and read  
>> in the second 200 and process those and so on?
>>
>> Thanks
>> Alex
>>
>> On Thu, Jun 3, 2010 at 6:43 PM, Romain Deltour <rdeltour@gmail.com>  
>> wrote:
>> Hi Alex,
>>
>> If I'm understanding correctly your intent and your pipeline, you  
>> should rather use the @group-adjacent attribute of the p:wrap- 
>> sequence step to pack 200 files at a time.
>>
>> Explanation:
>> In your pipeline, almost everything happens in one big p:for-each  
>> that iterates over the 1000 files. The p:choose subpipeline is  
>> executed only every 200 file, and the wrapper's input is a sequence  
>> of this unique file (modulo 200).
>> Actually, rather that grouping files by sets of 200, you ignore 199  
>> files and wrap only the 200th in an element before processing it.
>>
>> What I would do is:
>>
>> p:for-each => to iterate through the 1000 files and load the  
>> documents
>> (note the result of this first p:for-each is a sequence of 1000  
>> documents)
>> p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into  
>> 200-sets
>> p:for-each => another iteration over the 5 packs of 200 files, to  
>> process each pack at a time
>>
>> I hope this helps and I'm not missing your point...
>>
>> BR
>> Romain.
>>
>> Le 3 juin 10 à 18:32, Alex Muir a écrit :
>>
>>> Hi,
>>>
>>> I'm trying to read ~10000 files within a for-each loop, wrap a  
>>> selection from each set of 200 files and process them to output 1  
>>> html file, sink the processed files and continue with the  
>>> remaining files processing 200 at a time.
>>>
>>> Is that possible in xproc?
>>>
>>> I've got something like the following which I can't get to work. I  
>>> think that wrapper cannot be used within a for-each, is that the  
>>> case?
>>>
>>> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:c="http://www.w3.org/ns/xproc-step 
>>> "
>>>     xmlns:cx="http://xmlcalabash.com/ns/extensions"  
>>> name="wrapWithinForEach" version="1.0">
>>>
>>>     <p:input port="source">
>>>         <p:inline>
>>>             <xml/>
>>>         </p:inline>
>>>     </p:input>
>>>
>>>     <p:output port="result" sequence="true"/>
>>>
>>>     <p:declare-step type="cx:message" version="1.0">
>>>         <p:input port="source"/>
>>>         <p:output port="result"/>
>>>         <p:option name="message" required="true"/>
>>>     </p:declare-step>
>>>
>>>
>>>     <!-- ***** Starting and Ending File Numbers ***** -->
>>>     <p:variable name="startingFileNumber" select="'1'"/>
>>>     <p:variable name="endingFileNumber" select="'10000'"/>
>>>     <p:variable name="numberPerFile" select="'200'"/>
>>>
>>>     <!-- source and output folder variables -->
>>>     <p:variable name="source-folder" select="'completed/XML/'"/>
>>>     <p:variable name="output-folder" select="'MDNA/'"/>
>>>     <p:variable name="error-folder" select="'MDNA/error/'"/>
>>>     <p:variable name="exception-folder" select="'MDNA/exception/'"/>
>>>
>>>
>>>     <p:directory-list>
>>>         <p:with-option name="path" select="$source-folder">
>>>             <p:empty/>
>>>         </p:with-option>
>>>     </p:directory-list>
>>>
>>>
>>>     <p:for-each name="MDNA">
>>>
>>>
>>>         <p:iteration-source
>>>             select="//c:file[position() ge  
>>> number($startingFileNumber) and position() le  
>>> number($endingFileNumber)]"/>
>>>
>>>         <p:variable name="fileName" select="c:file/@name"/>
>>>         <p:variable name="startingIterationPosition"
>>>             select="number(p:iteration-position()) +  
>>> number($startingFileNumber)-1"/>
>>>
>>>        <cx:message>
>>>             <p:with-option name="message"
>>>                 select="concat('-----------------------------',  
>>> 'Iteration-position:','  ', $startingIterationPosition, '  File:  
>>> ', $fileName,'-----------------------------')"
>>>             />
>>>         </cx:message>
>>>
>>>         <p:load>
>>>             <p:with-option name="href" select="concat($source- 
>>> folder,$fileName)"/>
>>>         </p:load>
>>>
>>>         <cx:message>
>>>             <p:with-option name="message" select="'######    
>>> ExtractContent'"/>
>>>         </cx:message>
>>>         <p:xslt name="ExtractContent">
>>>             <p:input port="source"/>
>>>             <p:input port="stylesheet">
>>>                 <p:document href="ExtractContent.xsl"/>
>>>             </p:input>
>>>             <p:input port="parameters">
>>>                 <p:empty/>
>>>             </p:input>
>>>         </p:xslt>
>>>
>>>         <p:identity name="wrap"/>
>>>
>>>
>>>         <p:choose>
>>>             <p:when test="position() mod $numberPerFile eq 0">
>>>                 <p:wrap-sequence wrapper="WRAP" name="wrapper">
>>>                     <p:input port="source">
>>>                         <p:pipe port="result" step="wrap"/>
>>>                     </p:input>
>>>                 </p:wrap-sequence>
>>>
>>>
>>>                 <p:xslt name="CreateHTML">
>>>                     <p:input port="source"/>
>>>                     <p:input port="stylesheet">
>>>                         <p:document href="CreateHTML.xsl"/>
>>>                     </p:input>
>>>                     <p:input port="parameters">
>>>                         <p:empty/>
>>>                     </p:input>
>>>                 </p:xslt>
>>>
>>>
>>>                 <p:identity name="out_file"/>
>>>
>>>                 <p:store name="OUT">
>>>                     <p:with-option name="href"
>>>                         select="concat($output-folder,  
>>> 'MDNASections','-',$startingFileNumber,'-' , 
>>> $endingFileNumber,'.html')">
>>>                         <p:pipe step="out_file" port="result"/>
>>>                     </p:with-option>
>>>                 </p:store>
>>>
>>>                 <p:sink name="sinkIt"/>
>>>
>>>             </p:when>
>>>         </p:choose>
>>>
>>>     </p:for-each>
>>>
>>>
>>> </p:declare-step>
>>>
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>> -- 
>>> Alex
>>>
>>> An informal recording with one mic under a tree leads to some  
>>> pretty sweet acoustic sounds.
>>> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>>
>>
>>
>>
>> -- 
>> Alex
>>
>> An informal recording with one mic under a tree leads to some  
>> pretty sweet acoustic sounds.
>> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>
>
>
>
> -- 
> Alex
>
> An informal recording with one mic under a tree leads to some pretty  
> sweet acoustic sounds.
> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>
>
>
> -- 
> Alex
>
> An informal recording with one mic under a tree leads to some pretty  
> sweet acoustic sounds.
> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
Received on Friday, 4 June 2010 20:25:23 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 4 June 2010 20:25:23 GMT