- From: Romain Deltour <rdeltour@gmail.com>
- Date: Fri, 4 Jun 2010 22:24:44 +0200
- To: xproc-dev@w3.org
- Message-Id: <148ADAAE-0F4B-4408-853E-11C0F16437C5@gmail.com>
Hi again,
I tried a pure XProc equivalent of your XSLT (a little practice never
hurts ;-), and here's the result (it's shorter too):
<p:for-each>
<p:iteration-source select="//c:file"/>
<p:identity/>
</p:for-each>
<p:wrap-sequence wrapper="c:group" group-
adjacent="xs:integer((position()-1) div 2)"/>
<p:wrap-sequence wrapper="c:files"/>
1. The first p:for-each split the flat list in a sequence of c:file
documents.
2. The first p:wrap-sequence creates a sequence of 2-packs using the
group-adjacent feature.
3. The last p:wrap-sequence wraps the sequence of 2-packs in a single
document
Romain.
PS: there seems to be a bug in Calabash, which doesn't allows using
variables in the @group-adjacent expression
Le 4 juin 10 à 18:25, Alex Muir a écrit :
> Hi,
>
> Well I ended up modifying the p:directory list with a p:xslt given I
> didn't know how to do it using xproc and it's easy.
>
> So this xproc isn't yet processing the files (will get to that now)
> but does the first step,, reads the directory list and groups n file
> names pre group using the param filePerGroup. See xproc, xslt and
> output below.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:cx="http://xmlcalabash.com/ns/extensions
> "
> xmlns:c="http://www.w3.org/ns/xproc-step" name="chunk"
> version="1.0">
>
> <p:input port="source">
> <p:empty/>
> </p:input>
>
> <p:variable name="source-folder" select="'in/'"/>
> <p:variable name="output-folder" select="'out/'"/>
>
> <p:directory-list>
> <p:with-option name="path" select="$source-folder">
> <p:empty/>
> </p:with-option>
> </p:directory-list>
>
> <p:xslt version="1.0" name="chunkFiles">
> <p:input port="stylesheet">
> <p:document href="chunkFiles.xsl"/>
> </p:input>
> <p:with-param name="filePerGroup" select="2"/>
> <p:input port="parameters">
> <p:empty/>
> </p:input>
> </p:xslt>
>
>
> <p:store name="store">
> <p:with-option name="href" select="concat($output-
> folder,'directory-list.xml')"/>
> </p:store>
>
> </p:declare-step>
>
>
> XSLT FILE:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-
> prefixes="#all"
> xmlns:c="http://www.w3.org/ns/xproc-step" version="2.0">
> <xsl:output method="xml" indent="yes"/>
> <xsl:param name="filePerGroup"/>
> <xsl:template match="/c:directory">
> <xsl:variable name="directory" as="element()*">
>
> <chunk/>
> <xsl:for-each select="c:file">
>
> <file>
> <xsl:copy-of select="@*"/>
> </file>
> <xsl:if test="exists(following-sibling::c:file) and
> position() mod xs:integer($filePerGroup) eq 0">
> <chunk/>
> </xsl:if>
>
> </xsl:for-each>
>
> </xsl:variable>
>
> <!--
> <xsl:copy-of select="$directory"/>-->
>
>
> <files>
> <xsl:for-each-group select="$directory" group-starting-
> with="chunk">
> <group>
> <xsl:for-each select="current-group()">
> <xsl:if test="self::file">
> <file>
> <xsl:apply-templates select="file|
> @*"/>
> </file>
> </xsl:if>
> </xsl:for-each>
> </group>
> </xsl:for-each-group>
> </files>
>
> </xsl:template>
> </xsl:stylesheet>
>
> EXAMPLE OUTPUT
>
> <files>
> <group>
> <file>ONE.xml</file>
> <file>TWO.xml</file>
> </group>
> <group>
> <file>THREE.xml</file>
> <file>FOUR.xml</file>
> </group>
> <group>
> <file>FIVE.xml</file>
> <file>SIX.xml</file>
> </group>
> </files>
>
>
>
>
> On Fri, Jun 4, 2010 at 12:24 PM, Alex Muir <alex.g.muir@gmail.com>
> wrote:
> Thanks, looks good!
>
> I admit the recursive solution was giving me pause to implement.
>
>
> On Fri, Jun 4, 2010 at 11:47 AM, Romain Deltour <rdeltour@gmail.com>
> wrote:
>> Would the solution, to have to read all input files in before
>> processing the first set, be poor in terms of memory use?
>
>
> You can improve the pipeline depending on the most resource
> intensive step. If you want to reduce the number of XML documents
> parsed in memory, an alternative could be to work on the sequence of
> file paths returned by the p:directory-list rather than on the
> sequence of document. In other words, you would move the resource-
> intensive p:load from the first p:for-each to the second:
>
> p:for-each => to create a sequence of 100 paths from the flat list
> returned by p:directory-list
> (note the result of this first p:for-each is a sequence of 1000
> documents)
> p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into
> 200-sets
> p:for-each => another iteration over the 5 packs of 200 files, to
> process each pack at a time, loading the document then processing it
>
> Vojtech's idea of using recursion sounds good to.
>
> Romain.
>
> Le 4 juin 10 à 11:27, Alex Muir a écrit :
>
>> Hi Romain,
>>
>> Your solution looks like a good one and your not missing any points.
>>
>> Would the solution, to have to read all input files in before
>> processing the first set, be poor in terms of memory use?
>>
>> There is no way to read in the first 200 and process them and read
>> in the second 200 and process those and so on?
>>
>> Thanks
>> Alex
>>
>> On Thu, Jun 3, 2010 at 6:43 PM, Romain Deltour <rdeltour@gmail.com>
>> wrote:
>> Hi Alex,
>>
>> If I'm understanding correctly your intent and your pipeline, you
>> should rather use the @group-adjacent attribute of the p:wrap-
>> sequence step to pack 200 files at a time.
>>
>> Explanation:
>> In your pipeline, almost everything happens in one big p:for-each
>> that iterates over the 1000 files. The p:choose subpipeline is
>> executed only every 200 file, and the wrapper's input is a sequence
>> of this unique file (modulo 200).
>> Actually, rather that grouping files by sets of 200, you ignore 199
>> files and wrap only the 200th in an element before processing it.
>>
>> What I would do is:
>>
>> p:for-each => to iterate through the 1000 files and load the
>> documents
>> (note the result of this first p:for-each is a sequence of 1000
>> documents)
>> p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into
>> 200-sets
>> p:for-each => another iteration over the 5 packs of 200 files, to
>> process each pack at a time
>>
>> I hope this helps and I'm not missing your point...
>>
>> BR
>> Romain.
>>
>> Le 3 juin 10 à 18:32, Alex Muir a écrit :
>>
>>> Hi,
>>>
>>> I'm trying to read ~10000 files within a for-each loop, wrap a
>>> selection from each set of 200 files and process them to output 1
>>> html file, sink the processed files and continue with the
>>> remaining files processing 200 at a time.
>>>
>>> Is that possible in xproc?
>>>
>>> I've got something like the following which I can't get to work. I
>>> think that wrapper cannot be used within a for-each, is that the
>>> case?
>>>
>>> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:c="http://www.w3.org/ns/xproc-step
>>> "
>>> xmlns:cx="http://xmlcalabash.com/ns/extensions"
>>> name="wrapWithinForEach" version="1.0">
>>>
>>> <p:input port="source">
>>> <p:inline>
>>> <xml/>
>>> </p:inline>
>>> </p:input>
>>>
>>> <p:output port="result" sequence="true"/>
>>>
>>> <p:declare-step type="cx:message" version="1.0">
>>> <p:input port="source"/>
>>> <p:output port="result"/>
>>> <p:option name="message" required="true"/>
>>> </p:declare-step>
>>>
>>>
>>> <!-- ***** Starting and Ending File Numbers ***** -->
>>> <p:variable name="startingFileNumber" select="'1'"/>
>>> <p:variable name="endingFileNumber" select="'10000'"/>
>>> <p:variable name="numberPerFile" select="'200'"/>
>>>
>>> <!-- source and output folder variables -->
>>> <p:variable name="source-folder" select="'completed/XML/'"/>
>>> <p:variable name="output-folder" select="'MDNA/'"/>
>>> <p:variable name="error-folder" select="'MDNA/error/'"/>
>>> <p:variable name="exception-folder" select="'MDNA/exception/'"/>
>>>
>>>
>>> <p:directory-list>
>>> <p:with-option name="path" select="$source-folder">
>>> <p:empty/>
>>> </p:with-option>
>>> </p:directory-list>
>>>
>>>
>>> <p:for-each name="MDNA">
>>>
>>>
>>> <p:iteration-source
>>> select="//c:file[position() ge
>>> number($startingFileNumber) and position() le
>>> number($endingFileNumber)]"/>
>>>
>>> <p:variable name="fileName" select="c:file/@name"/>
>>> <p:variable name="startingIterationPosition"
>>> select="number(p:iteration-position()) +
>>> number($startingFileNumber)-1"/>
>>>
>>> <cx:message>
>>> <p:with-option name="message"
>>> select="concat('-----------------------------',
>>> 'Iteration-position:',' ', $startingIterationPosition, ' File:
>>> ', $fileName,'-----------------------------')"
>>> />
>>> </cx:message>
>>>
>>> <p:load>
>>> <p:with-option name="href" select="concat($source-
>>> folder,$fileName)"/>
>>> </p:load>
>>>
>>> <cx:message>
>>> <p:with-option name="message" select="'######
>>> ExtractContent'"/>
>>> </cx:message>
>>> <p:xslt name="ExtractContent">
>>> <p:input port="source"/>
>>> <p:input port="stylesheet">
>>> <p:document href="ExtractContent.xsl"/>
>>> </p:input>
>>> <p:input port="parameters">
>>> <p:empty/>
>>> </p:input>
>>> </p:xslt>
>>>
>>> <p:identity name="wrap"/>
>>>
>>>
>>> <p:choose>
>>> <p:when test="position() mod $numberPerFile eq 0">
>>> <p:wrap-sequence wrapper="WRAP" name="wrapper">
>>> <p:input port="source">
>>> <p:pipe port="result" step="wrap"/>
>>> </p:input>
>>> </p:wrap-sequence>
>>>
>>>
>>> <p:xslt name="CreateHTML">
>>> <p:input port="source"/>
>>> <p:input port="stylesheet">
>>> <p:document href="CreateHTML.xsl"/>
>>> </p:input>
>>> <p:input port="parameters">
>>> <p:empty/>
>>> </p:input>
>>> </p:xslt>
>>>
>>>
>>> <p:identity name="out_file"/>
>>>
>>> <p:store name="OUT">
>>> <p:with-option name="href"
>>> select="concat($output-folder,
>>> 'MDNASections','-',$startingFileNumber,'-' ,
>>> $endingFileNumber,'.html')">
>>> <p:pipe step="out_file" port="result"/>
>>> </p:with-option>
>>> </p:store>
>>>
>>> <p:sink name="sinkIt"/>
>>>
>>> </p:when>
>>> </p:choose>
>>>
>>> </p:for-each>
>>>
>>>
>>> </p:declare-step>
>>>
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>> --
>>> Alex
>>>
>>> An informal recording with one mic under a tree leads to some
>>> pretty sweet acoustic sounds.
>>> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>>
>>
>>
>>
>> --
>> Alex
>>
>> An informal recording with one mic under a tree leads to some
>> pretty sweet acoustic sounds.
>> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>
>
>
>
> --
> Alex
>
> An informal recording with one mic under a tree leads to some pretty
> sweet acoustic sounds.
> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
>
>
>
> --
> Alex
>
> An informal recording with one mic under a tree leads to some pretty
> sweet acoustic sounds.
> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
Received on Friday, 4 June 2010 20:25:23 UTC