Re: manifest based processing from Romain Deltour on 2014-02-19 (xproc-dev@w3.org from February 2014)

From: Romain Deltour <rdeltour@gmail.com>
Date: Wed, 19 Feb 2014 14:07:57 +0100
To: Jostein Austvik Jacobsen <josteinaj@gmail.com>
Cc: James Fuller <jim@webcomposite.com>, XProc Dev <xproc-dev@w3.org>
Message-Id: <B2B06318-A306-4AA0-81E8-6969B11E3591@gmail.com>
Thanks Jostein for describing our take on the “manifest” document.

Let me add that:

a. The port named “in-memory” is really just a naming convention, in practice the XProc processor may or may not keep those documents in memory.

b. We use the “manifest’ document to store serialization options.

c. When also store the “original” URI of the files we do not touch but simply want to move (i.e. images, stylesheets, etc), so that we can easily move them from the input file set to the output file set at the end of the workflow.

d. we have a whole set of primitive utils to manipulate these file set documents: create, add files, move files, load an XML doc listed in a file set, store a whole file set, etc.

b.  what I suggested (on tweeter) was not an extension of the “kind” attribute, but rather being able to declare the media-type expected on input / output; combined –if possible– with the possibility to implicitly connect ports depending on their type. I admit this is easier said than done, and I’ve not put a lot of thinking to this.

This approach has its pros and cons, but we use it consistently and it works reasonably well for us with XProc v1 :)

Romain.

On 19 févr. 2014, at 12:35, Jostein Austvik Jacobsen <josteinaj@gmail.com> wrote:

> So a pattern we're using is to provide the manifest (or "fileset") on the primary input/output port, and "in-memory" documents on a secondary input/output port. (Sometimes I also want to generate a report, in which case another secondary input/output port is used.) So most steps in our library are implemented with a signature similar to this:
> 
> <p:declare-step name="main" ...>
>     <p:input port="fileset.in" primary="true"/>
>     <p:input port="in-memory.in" sequence="true"/>
>     <p:output port="fileset.out" primary="true"/>
>     <p:output port="in-memory.out" sequence="true"/>
> </p:declare-step>
> 
> This makes it relatively easy to connect multiple steps, although only one of the ports can have a default connection (the fileset in this case), and the rest will have to be explicitly connected:
> 
> <p:declare-step ...>
>     <p:documentation>Convert from HTML to EPUB and validate input/output.</p:documentation>
>     
>     <p:option name="input-html-href" required="true"/>
>     <p:option name="output-epub-href" required="true"/>
> 
>     <px:html-load name="html-load">
>         <p:with-option name="href" select="$input-html-href"/>
>     </px:html-load>
>     
>     <px:html-validate name="html-validate">
>         <p:input port="in-memory.in">
>             <p:input port="in-memory.out" step="html-load"/>
>         </p:input>
>     </px:html-validate>
>     
>     <px:html-to-epub name="html-to-epub">
>         <p:input port="in-memory.in">
>             <p:input port="in-memory.out" step="html-validate"/>
>         </p:input>
>     </px:html-to-epub>
>     
>     <px:epub-validate name="epub-validate">
>         <p:input port="in-memory.in">
>             <p:input port="in-memory.out" step="html-to-epub"/>
>         </p:input>
>     </px:epub-validate>
>     
>     <px:epub-store name="epub-store">
>         <p:input port="in-memory.in">
>             <p:input port="in-memory.out" step="epub-validate"/>
>         </p:input>
>         <p:with-option name="href" select="$output-epub-href"/>
>     </px:epub-store>
> 
> </p:declare-step>
> 
> 
> It would be useful if the "kind" attribute were more flexible. I think this has been suggested before (by Romain?). If custom kinds were allowed, then multiple ports could be primary:
> 
> <p:declare-step name="main" ...>
>     <!-- "primary" attributes added for verbosity, ports would be primary by default since they are the only ones of their kind -->
>     <p:input port="fileset.in" primary="true"/>
>     <p:input port="in-memory.in" sequence="true" primary="true" kind="in-memory"/>
>     <p:output port="fileset.out" primary="true"/>
>     <p:output port="in-memory.out" sequence="true" primary="true" kind="in-memory"/>
> </p:declare-step>
> 
> This would greatly reduce the size of the pipeline:
> 
> <p:declare-step ...>
>     <p:documentation>Convert from HTML to EPUB and validate input/output.</p:documentation>
>     
>     <p:option name="input-html-href" required="true"/>
>     <p:option name="output-epub-href" required="true"/>
> 
>     <px:html-load name="html-load">
>         <p:with-option name="href" select="$input-html-href"/>
>     </px:html-load>
>     
>     <px:html-validate name="html-validate"/>
>     
>     <px:html-to-epub name="html-to-epub"/>
>     
>     <px:epub-validate name="epub-validate"/>
>     
>     <px:epub-store name="epub-store">
>         <p:with-option name="href" select="$output-epub-href"/>
>     </px:epub-store>
> 
> </p:declare-step>
> 
> 
> 
> 
> 
> Jostein
> 
> 
> On 19 February 2014 10:25, James Fuller <jim@webcomposite.com> wrote:
> A common idiom used in XProc is to define a manifest of
> documents/assets to work on and have that flow through the pipeline vs
> data documents flowing through.
> 
> Typically, its a collection of URI's that each require a pipeline of
> processing for each different content type / data type which then gets
> aggregated up into some final result structure.
> 
> This approach sometimes leads to convoluted 'procedural' pipelines ...
> which are less reusable and harder to comprehend.
> 
> Even with non-xml data flowing through (as proposed for v2), for
> example a zip file (EPUB), we have the same class of problem where the
> zip manifest is our routing table determining processing of secondary
> data assets.
> 
> I would like to dig deeper into how we might be able to make life
> easier with these kind of pipelines
> 
> Imagine passing a sequence of uris to a pipeline as primary input; the
> pipeline's main responsibility is to deal with end result of
> processing (serialisation, etc) where each individual content type is
> processed by a separate pipeline.
> 
> I can imagine a lot of ways of building this kind of thing with XProc
> v1 (and have) but wondering what could we enhance/add to vnext to
> simplify, making things easier to (re)use ? The problems I see are;
> 
> * how to deal with mapping a step/pipeline to a content type ?
> * default posture - mutation in place vs copy of data ?
> * dependencies - some uris need to be processed before others
> 
> there are other issues that need thinking through but thought I would
> 'toss over the wall' to solicit opinion.
> 
> Jim Fuller
> 
>
Received on Wednesday, 19 February 2014 13:08:29 UTC