- From: Alex Muir <alex.g.muir@gmail.com>
- Date: Fri, 4 Jun 2010 12:24:16 +0000
- To: Romain Deltour <rdeltour@gmail.com>
- Cc: xproc-dev@w3.org
- Message-ID: <AANLkTimiKxhnT6bFNp786PIyoYwjWU0K_QbAYFW6JvE2@mail.gmail.com>
Thanks, looks good! I admit the recursive solution was giving me pause to implement. On Fri, Jun 4, 2010 at 11:47 AM, Romain Deltour <rdeltour@gmail.com> wrote: > Would the solution, to have to read all input files in before processing > the first set, be poor in terms of memory use? > > > You can improve the pipeline depending on the most resource intensive step. > If you want to reduce the number of XML documents parsed in memory, an > alternative could be to work on the sequence of file paths returned by the > p:directory-list rather than on the sequence of document. In other words, > you would move the resource-intensive p:load from the first p:for-each to > the second: > > p:for-each => to create a sequence of 100 paths from the flat list returned > by p:directory-list > (note the result of this first p:for-each is a sequence of 1000 documents) > p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into 200-sets > p:for-each => another iteration over the 5 packs of 200 files, to process > each pack at a time, loading the document then processing it > > Vojtech's idea of using recursion sounds good to. > > Romain. > > Le 4 juin 10 à 11:27, Alex Muir a écrit : > > Hi Romain, > > Your solution looks like a good one and your not missing any points. > > Would the solution, to have to read all input files in before processing > the first set, be poor in terms of memory use? > > There is no way to read in the first 200 and process them and read in the > second 200 and process those and so on? > > Thanks > Alex > > On Thu, Jun 3, 2010 at 6:43 PM, Romain Deltour <rdeltour@gmail.com> wrote: > >> Hi Alex, >> >> If I'm understanding correctly your intent and your pipeline, you should >> rather use the @group-adjacent attribute of the p:wrap-sequence step to pack >> 200 files at a time. >> >> Explanation: >> In your pipeline, almost everything happens in one big p:for-each that >> iterates over the 1000 files. The p:choose subpipeline is executed only >> every 200 file, and the wrapper's input is a sequence of this unique file >> (modulo 200). >> Actually, rather that grouping files by sets of 200, you ignore 199 files >> and wrap only the 200th in an element before processing it. >> >> What I would do is: >> >> p:for-each => to iterate through the 1000 files and load the documents >> (note the result of this first p:for-each is a sequence of 1000 documents) >> p:wrap-seqence[@group-adjacent] => split the sequence of 1000 into >> 200-sets >> p:for-each => another iteration over the 5 packs of 200 files, to process >> each pack at a time >> >> I hope this helps and I'm not missing your point... >> >> BR >> Romain. >> >> Le 3 juin 10 à 18:32, Alex Muir a écrit : >> >> Hi, >> >> I'm trying to read ~10000 files within a for-each loop, wrap a selection >> from each set of 200 files and process them to output 1 html file, sink the >> processed files and continue with the remaining files processing 200 at a >> time. >> >> Is that possible in xproc? >> >> I've got something like the following which I can't get to work. I think >> that wrapper cannot be used within a for-each, is that the case? >> >> <p:declare-step xmlns:p="http://www.w3.org/ns/xproc" xmlns:c=" >> http://www.w3.org/ns/xproc-step" >> xmlns:cx="http://xmlcalabash.com/ns/extensions" >> name="wrapWithinForEach" version="1.0"> >> >> <p:input port="source"> >> <p:inline> >> <xml/> >> </p:inline> >> </p:input> >> >> <p:output port="result" sequence="true"/> >> >> <p:declare-step type="cx:message" version="1.0"> >> <p:input port="source"/> >> <p:output port="result"/> >> <p:option name="message" required="true"/> >> </p:declare-step> >> >> >> <!-- ***** Starting and Ending File Numbers ***** --> >> <p:variable name="startingFileNumber" select="'1'"/> >> <p:variable name="endingFileNumber" select="'10000'"/> >> <p:variable name="numberPerFile" select="'200'"/> >> >> <!-- source and output folder variables --> >> <p:variable name="source-folder" select="'completed/XML/'"/> >> <p:variable name="output-folder" select="'MDNA/'"/> >> <p:variable name="error-folder" select="'MDNA/error/'"/> >> <p:variable name="exception-folder" select="'MDNA/exception/'"/> >> >> >> <p:directory-list> >> <p:with-option name="path" select="$source-folder"> >> <p:empty/> >> </p:with-option> >> </p:directory-list> >> >> >> <p:for-each name="MDNA"> >> >> >> <p:iteration-source >> select="//c:file[position() ge number($startingFileNumber) and >> position() le number($endingFileNumber)]"/> >> >> <p:variable name="fileName" select="c:file/@name"/> >> <p:variable name="startingIterationPosition" >> select="number(p:iteration-position()) + >> number($startingFileNumber)-1"/> >> >> <cx:message> >> <p:with-option name="message" >> select="concat('-----------------------------', >> 'Iteration-position:',' ', $startingIterationPosition, ' File: ', >> $fileName,'-----------------------------')" >> /> >> </cx:message> >> >> <p:load> >> <p:with-option name="href" >> select="concat($source-folder,$fileName)"/> >> </p:load> >> >> <cx:message> >> <p:with-option name="message" select="'###### >> ExtractContent'"/> >> </cx:message> >> <p:xslt name="ExtractContent"> >> <p:input port="source"/> >> <p:input port="stylesheet"> >> <p:document href="ExtractContent.xsl"/> >> </p:input> >> <p:input port="parameters"> >> <p:empty/> >> </p:input> >> </p:xslt> >> >> <p:identity name="wrap"/> >> >> >> <p:choose> >> <p:when test="position() mod $numberPerFile eq 0"> >> <p:wrap-sequence wrapper="WRAP" name="wrapper"> >> <p:input port="source"> >> <p:pipe port="result" step="wrap"/> >> </p:input> >> </p:wrap-sequence> >> >> >> <p:xslt name="CreateHTML"> >> <p:input port="source"/> >> <p:input port="stylesheet"> >> <p:document href="CreateHTML.xsl"/> >> </p:input> >> <p:input port="parameters"> >> <p:empty/> >> </p:input> >> </p:xslt> >> >> >> <p:identity name="out_file"/> >> >> <p:store name="OUT"> >> <p:with-option name="href" >> select="concat($output-folder, >> 'MDNASections','-',$startingFileNumber,'-' ,$endingFileNumber,'.html')"> >> <p:pipe step="out_file" port="result"/> >> </p:with-option> >> </p:store> >> >> <p:sink name="sinkIt"/> >> >> </p:when> >> </p:choose> >> >> </p:for-each> >> >> >> </p:declare-step> >> >> >> >> >> Regards >> >> >> -- >> Alex >> >> An informal recording with one mic under a tree leads to some pretty sweet >> acoustic sounds. >> https://sites.google.com/site/greigconteh/albums/diabarte-and-sons >> >> >> > > > -- > Alex > > An informal recording with one mic under a tree leads to some pretty sweet > acoustic sounds. > https://sites.google.com/site/greigconteh/albums/diabarte-and-sons > > > -- Alex An informal recording with one mic under a tree leads to some pretty sweet acoustic sounds. https://sites.google.com/site/greigconteh/albums/diabarte-and-sons
Received on Friday, 4 June 2010 12:24:49 UTC