W3C home > Mailing lists > Public > xproc-dev@w3.org > June 2010

Re: Can one within a for-each loop wrap, output, sink a set of files and continue processing with remaining files?

From: Alex Muir <alex.g.muir@gmail.com>
Date: Sat, 5 Jun 2010 12:07:28 +0000
Message-ID: <AANLkTil33uJd9xHXKA9WjcL1xL006xlgHiPDYcHq7bzl@mail.gmail.com>
To: Romain Deltour <rdeltour@gmail.com>
Cc: xproc-dev@w3.org
Hi,

Looks good..

I had to add starting and ending numbers to choose which group of files to
process as follows in the modified xslt.

Pasted an xproc sample of working with the output loading a group of files,
wrapping and saving and loading the next group in case someone might need
that.

Thanks much for the help
Alex


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
    xmlns:c="http://www.w3.org/ns/xproc-step" version="2.0">
    <xsl:output method="xml" indent="yes"/>
    <xsl:param name="filePerGroup"/>
    <xsl:param name="startingFileNumber"/>
    <xsl:param name="endingFileNumber"/>
    <xsl:template match="/c:directory">
        <xsl:variable name="directory" as="element()*">

            <chunk/>
            <xsl:for-each select="c:file[position() ge
xs:integer($startingFileNumber) and position() le
xs:integer($endingFileNumber)]">

                <file>
                    <xsl:copy-of select="@*"/>
                </file>
                <xsl:if test="exists(following-sibling::c:file) and
position() mod xs:integer($filePerGroup) eq 0">
                    <chunk/>
                </xsl:if>

            </xsl:for-each>

        </xsl:variable>


        <files>
            <xsl:for-each-group select="$directory"
group-starting-with="chunk">
                <group index="{position()}">
                    <xsl:for-each select="current-group()">
                        <xsl:if test="self::file">
                            <file>
                                <xsl:apply-templates select="file|@*"/>
                            </file>
                        </xsl:if>
                    </xsl:for-each>
                </group>
            </xsl:for-each-group>
        </files>

    </xsl:template>
</xsl:stylesheet>

XML


XPROC

  <p:for-each name="groupedFiles">

        <p:iteration-source select="//group"/>
        <p:variable name="index" select="group/@index"/>


        <p:for-each name="group">
            <p:iteration-source select="//file"/>
            <p:variable name="filename" select="."/>

            <p:load>
                <p:with-option name="href"
select="concat($source-folder,$filename)"/>
            </p:load>

        </p:for-each>

        <p:wrap-sequence wrapper="group" name="grouped"/>

        <p:store name="store">
            <p:with-option name="href"
select="concat($output-folder,$index,'.xml')"/>
        </p:store>

    </p:for-each>



On Fri, Jun 4, 2010 at 8:24 PM, Romain Deltour <rdeltour@gmail.com> wrote:

> Hi again,
>
> I tried a pure XProc equivalent of your XSLT (a little practice never hurts
> ;-), and here's the result (it's shorter too):
>
>  <p:for-each>
>     <p:iteration-source select="//c:file"/>
>     <p:identity/>
>  </p:for-each>
> <p:wrap-sequence wrapper="c:group"
> group-adjacent="xs:integer((position()-1) div 2)"/>
> <p:wrap-sequence wrapper="c:files"/>
>
> 1. The first p:for-each split the flat list in a sequence of c:file
> documents.
> 2. The first p:wrap-sequence creates a sequence of 2-packs using the
> group-adjacent feature.
> 3. The last p:wrap-sequence wraps the sequence of 2-packs in a single
> document
>
> Romain.
>
> PS: there seems to be a bug in Calabash, which doesn't allows using
> variables in the @group-adjacent expression
>
>
> Le 4 juin 10 à 18:25, Alex Muir a écrit :
>
> Hi,
>
> Well I ended up modifying the p:directory list with a p:xslt given I didn't
> know how to do it using xproc and it's easy.
>
> So this xproc isn't yet processing the files (will get to that now) but
> does the first step,, reads the directory list and groups n file names pre
> group using the param filePerGroup.  See xproc, xslt and output below.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:cx="
> http://xmlcalabash.com/ns/extensions"
>     xmlns:c="http://www.w3.org/ns/xproc-step" name="chunk" version="1.0">
>
>     <p:input port="source">
>         <p:empty/>
>     </p:input>
>
>     <p:variable name="source-folder" select="'in/'"/>
>     <p:variable name="output-folder" select="'out/'"/>
>
>     <p:directory-list>
>         <p:with-option name="path" select="$source-folder">
>             <p:empty/>
>         </p:with-option>
>     </p:directory-list>
>
>     <p:xslt version="1.0" name="chunkFiles">
>         <p:input port="stylesheet">
>             <p:document href="chunkFiles.xsl"/>
>         </p:input>
>         <p:with-param name="filePerGroup" select="2"/>
>         <p:input port="parameters">
>             <p:empty/>
>         </p:input>
>     </p:xslt>
>
>
>     <p:store name="store">
>         <p:with-option name="href"
> select="concat($output-folder,'directory-list.xml')"/>
>     </p:store>
>
> </p:declare-step>
>
>
> XSLT FILE:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>     xmlns:xs="http://www.w3.org/2001/XMLSchema"
> exclude-result-prefixes="#all"
>     xmlns:c="http://www.w3.org/ns/xproc-step" version="2.0">
>     <xsl:output method="xml" indent="yes"/>
>     <xsl:param name="filePerGroup"/>
>     <xsl:template match="/c:directory">
>         <xsl:variable name="directory" as="element()*">
>
>             <chunk/>
>             <xsl:for-each select="c:file">
>
>                 <file>
>                     <xsl:copy-of select="@*"/>
>                 </file>
>                 <xsl:if test="exists(following-sibling::c:file) and
> position() mod xs:integer($filePerGroup) eq 0">
>                     <chunk/>
>                 </xsl:if>
>
>             </xsl:for-each>
>
>         </xsl:variable>
>
>         <!--
>         <xsl:copy-of select="$directory"/>-->
>
>
>         <files>
>             <xsl:for-each-group select="$directory"
> group-starting-with="chunk">
>                 <group>
>                     <xsl:for-each select="current-group()">
>                         <xsl:if test="self::file">
>                             <file>
>                                 <xsl:apply-templates select="file|@*"/>
>                             </file>
>                         </xsl:if>
>                     </xsl:for-each>
>                 </group>
>             </xsl:for-each-group>
>         </files>
>
>     </xsl:template>
> </xsl:stylesheet>
>
> EXAMPLE OUTPUT
>
> <files>
>     <group>
>         <file>ONE.xml</file>
>         <file>TWO.xml</file>
>     </group>
>     <group>
>         <file>THREE.xml</file>
>         <file>FOUR.xml</file>
>     </group>
>     <group>
>         <file>FIVE.xml</file>
>         <file>SIX.xml</file>
>     </group>
> </files>
>
>
>
>
> On Fri, Jun 4, 2010 at 12:24 PM, Alex Muir <alex.g.muir@gmail.com> wrote:
>
>> Thanks, looks good!
>>
>> I admit the recursive solution was giving me pause to implement.
>>
>>
>> On Fri, Jun 4, 2010 at 11:47 AM, Romain Deltour <rdeltour@gmail.com>wrote:
>>
>>> Would the solution, to have to read all input files in before processing
>>> the first set, be poor in terms of memory use?
>>>
>>>
>>> You can improve the pipeline depending on the most resource intensive
>>> step. If you want to reduce the number of XML documents parsed in memory, an
>>> alternative could be to work on the sequence of file paths returned by the
>>> p:directory-list rather than on the sequence of document. In other words,
>>> you would move the resource-intensive p:load from the first p:for-each to
>>> the second:
>>>
>>> p:for-each => to create a sequence of 100 paths from the flat list
>>> returned by p:directory-list
>>> (note the result of this first p:for-each is a sequence of 1000
>>> documents)
>>> p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into
>>> 200-sets
>>> p:for-each => another iteration over the 5 packs of 200 files, to process
>>> each pack at a time, loading the document then processing it
>>>
>>> Vojtech's idea of using recursion sounds good to.
>>>
>>> Romain.
>>>
>>> Le 4 juin 10 à 11:27, Alex Muir a écrit :
>>>
>>> Hi Romain,
>>>
>>> Your solution looks like a good one and your not missing any points.
>>>
>>> Would the solution, to have to read all input files in before processing
>>> the first set, be poor in terms of memory use?
>>>
>>> There is no way to read in the first 200 and process them and read in the
>>> second 200 and process those and so on?
>>>
>>> Thanks
>>> Alex
>>>
>>> On Thu, Jun 3, 2010 at 6:43 PM, Romain Deltour <rdeltour@gmail.com>wrote:
>>>
>>>> Hi Alex,
>>>>
>>>> If I'm understanding correctly your intent and your pipeline, you should
>>>> rather use the @group-adjacent attribute of the p:wrap-sequence step to pack
>>>> 200 files at a time.
>>>>
>>>> Explanation:
>>>> In your pipeline, almost everything happens in one big p:for-each that
>>>> iterates over the 1000 files. The p:choose subpipeline is executed only
>>>> every 200 file, and the wrapper's input is a sequence of this unique file
>>>> (modulo 200).
>>>> Actually, rather that grouping files by sets of 200, you ignore 199
>>>> files and wrap only the 200th in an element before processing it.
>>>>
>>>> What I would do is:
>>>>
>>>> p:for-each => to iterate through the 1000 files and load the documents
>>>> (note the result of this first p:for-each is a sequence of 1000
>>>> documents)
>>>> p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into
>>>> 200-sets
>>>> p:for-each => another iteration over the 5 packs of 200 files, to
>>>> process each pack at a time
>>>>
>>>> I hope this helps and I'm not missing your point...
>>>>
>>>> BR
>>>> Romain.
>>>>
>>>> Le 3 juin 10 à 18:32, Alex Muir a écrit :
>>>>
>>>>  Hi,
>>>>
>>>> I'm trying to read ~10000 files within a for-each loop, wrap a selection
>>>> from each set of 200 files and process them to output 1 html file, sink the
>>>> processed files and continue with the remaining files processing 200 at a
>>>> time.
>>>>
>>>> Is that possible in xproc?
>>>>
>>>> I've got something like the following which I can't get to work. I think
>>>> that wrapper cannot be used within a for-each, is that the case?
>>>>
>>>> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:c="
>>>> http://www.w3.org/ns/xproc-step"
>>>>     xmlns:cx="http://xmlcalabash.com/ns/extensions"
>>>> name="wrapWithinForEach" version="1.0">
>>>>
>>>>     <p:input port="source">
>>>>         <p:inline>
>>>>             <xml/>
>>>>         </p:inline>
>>>>     </p:input>
>>>>
>>>>     <p:output port="result" sequence="true"/>
>>>>
>>>>     <p:declare-step type="cx:message" version="1.0">
>>>>         <p:input port="source"/>
>>>>         <p:output port="result"/>
>>>>         <p:option name="message" required="true"/>
>>>>     </p:declare-step>
>>>>
>>>>
>>>>     <!-- ***** Starting and Ending File Numbers ***** -->
>>>>     <p:variable name="startingFileNumber" select="'1'"/>
>>>>     <p:variable name="endingFileNumber" select="'10000'"/>
>>>>     <p:variable name="numberPerFile" select="'200'"/>
>>>>
>>>>     <!-- source and output folder variables -->
>>>>     <p:variable name="source-folder" select="'completed/XML/'"/>
>>>>     <p:variable name="output-folder" select="'MDNA/'"/>
>>>>     <p:variable name="error-folder" select="'MDNA/error/'"/>
>>>>     <p:variable name="exception-folder" select="'MDNA/exception/'"/>
>>>>
>>>>
>>>>     <p:directory-list>
>>>>         <p:with-option name="path" select="$source-folder">
>>>>             <p:empty/>
>>>>         </p:with-option>
>>>>     </p:directory-list>
>>>>
>>>>
>>>>     <p:for-each name="MDNA">
>>>>
>>>>
>>>>         <p:iteration-source
>>>>             select="//c:file[position() ge number($startingFileNumber)
>>>> and position() le number($endingFileNumber)]"/>
>>>>
>>>>         <p:variable name="fileName" select="c:file/@name"/>
>>>>         <p:variable name="startingIterationPosition"
>>>>             select="number(p:iteration-position()) +
>>>> number($startingFileNumber)-1"/>
>>>>
>>>>        <cx:message>
>>>>             <p:with-option name="message"
>>>>                 select="concat('-----------------------------',
>>>> 'Iteration-position:','  ', $startingIterationPosition, '  File: ',
>>>> $fileName,'-----------------------------')"
>>>>             />
>>>>         </cx:message>
>>>>
>>>>         <p:load>
>>>>             <p:with-option name="href"
>>>> select="concat($source-folder,$fileName)"/>
>>>>         </p:load>
>>>>
>>>>         <cx:message>
>>>>             <p:with-option name="message" select="'######
>>>> ExtractContent'"/>
>>>>         </cx:message>
>>>>         <p:xslt name="ExtractContent">
>>>>             <p:input port="source"/>
>>>>             <p:input port="stylesheet">
>>>>                 <p:document href="ExtractContent.xsl"/>
>>>>             </p:input>
>>>>             <p:input port="parameters">
>>>>                 <p:empty/>
>>>>             </p:input>
>>>>         </p:xslt>
>>>>
>>>>         <p:identity name="wrap"/>
>>>>
>>>>
>>>>         <p:choose>
>>>>             <p:when test="position() mod $numberPerFile eq 0">
>>>>                 <p:wrap-sequence wrapper="WRAP" name="wrapper">
>>>>                     <p:input port="source">
>>>>                         <p:pipe port="result" step="wrap"/>
>>>>                     </p:input>
>>>>                 </p:wrap-sequence>
>>>>
>>>>
>>>>                 <p:xslt name="CreateHTML">
>>>>                     <p:input port="source"/>
>>>>                     <p:input port="stylesheet">
>>>>                         <p:document href="CreateHTML.xsl"/>
>>>>                     </p:input>
>>>>                     <p:input port="parameters">
>>>>                         <p:empty/>
>>>>                     </p:input>
>>>>                 </p:xslt>
>>>>
>>>>
>>>>                 <p:identity name="out_file"/>
>>>>
>>>>                 <p:store name="OUT">
>>>>                     <p:with-option name="href"
>>>>                         select="concat($output-folder,
>>>> 'MDNASections','-',$startingFileNumber,'-' ,$endingFileNumber,'.html')">
>>>>                         <p:pipe step="out_file" port="result"/>
>>>>                     </p:with-option>
>>>>                 </p:store>
>>>>
>>>>                 <p:sink name="sinkIt"/>
>>>>
>>>>             </p:when>
>>>>         </p:choose>
>>>>
>>>>     </p:for-each>
>>>>
>>>>
>>>> </p:declare-step>
>>>>
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>>
>>>> --
>>>> Alex
>>>>
>>>> An informal recording with one mic under a tree leads to some pretty
>>>> sweet acoustic sounds.
>>>> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Alex
>>>
>>> An informal recording with one mic under a tree leads to some pretty
>>> sweet acoustic sounds.
>>> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>>>
>>>
>>>
>>
>>
>> --
>> Alex
>>
>> An informal recording with one mic under a tree leads to some pretty sweet
>> acoustic sounds.
>> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>>
>
>
>
> --
> Alex
>
> An informal recording with one mic under a tree leads to some pretty sweet
> acoustic sounds.
> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>
>
>


-- 
Alex

An informal recording with one mic under a tree leads to some pretty sweet
acoustic sounds.
https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
Received on Saturday, 5 June 2010 12:08:10 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 5 June 2010 12:08:11 GMT