Re: manifest based processing from Jostein Austvik Jacobsen on 2014-02-19 (xproc-dev@w3.org from February 2014)

From: Jostein Austvik Jacobsen <josteinaj@gmail.com>
Date: Wed, 19 Feb 2014 12:35:58 +0100
To: James Fuller <jim@webcomposite.com>
Cc: XProc Dev <xproc-dev@w3.org>
Message-ID: <CAOCxfQedc9TWps7CMHkB+DM_sD_4D1BRiwzDWiz6O8_8pNspWw@mail.gmail.com>
So a pattern we're using is to provide the manifest (or "fileset") on the
primary input/output port, and "in-memory" documents on a secondary
input/output port. (Sometimes I also want to generate a report, in which
case another secondary input/output port is used.) So most steps in our
library are implemented with a signature similar to this:

*<p:declare-step name="main" ...>*
*    <p:input port="fileset.in <http://fileset.in>" primary="true"/>*
*    <p:input port="in-memory.in <http://in-memory.in>" sequence="true"/>*

*    <p:output port="fileset.out" primary="true"/> *

*    <p:output port="in-memory.out" sequence="true"/>*
*</p:declare-step>*

This makes it relatively easy to connect multiple steps, although only one
of the ports can have a default connection (the fileset in this case), and
the rest will have to be explicitly connected:

*<p:declare-step ...>*
*    <p:documentation>Convert from HTML to EPUB and validate
input/output.</p:documentation>*

*    <p:option name="input-html-href" required="true"/>*
*    <p:option name="output-epub-href" required="true"/>*

*    <px:html-load name="html-load">*
*        <p:with-option name="href" select="$input-html-href"/>*
*    </px:html-load>*

*    <px:html-validate name="html-validate">*
*        <p:input port="in-memory.in <http://in-memory.in>">*
*            <p:input port="in-memory.out" step="html-load"/>*
*        </p:input>*
*    </px:html-validate>*

*    <px:html-to-epub name="html-to-epub">*
*        <p:input port="in-memory.in <http://in-memory.in>">*
*            <p:input port="in-memory.out" step="html-validate"/>*
*        </p:input>*
*    </px:html-to-epub>*

*    <px:epub-validate name="epub-validate">*
*        <p:input port="in-memory.in <http://in-memory.in>">*
*            <p:input port="in-memory.out" step="html-to-epub"/>*
*        </p:input>*
*    </px:epub-validate>*

*    <px:epub-store name="epub-store">*
*        <p:input port="in-memory.in <http://in-memory.in>">*
*            <p:input port="in-memory.out" step="epub-validate"/>*
*        </p:input>*
*        <p:with-option name="href" select="$output-epub-href"/>*
*    </px:epub-store>*

*</p:declare-step>*


It would be useful if the "kind" attribute were more flexible. I think this
has been suggested before (by Romain?). If custom kinds were allowed, then
multiple ports could be primary:

*<p:declare-step name="main" ...>*
*    <!-- "primary" attributes added for verbosity, ports would be primary
by default since they are the only ones of their kind -->*
*    <p:input port="fileset.in <http://fileset.in>" primary="true"/>*
*    <p:input port="in-memory.in <http://in-memory.in>" sequence="true"
primary="true" kind="in-memory"/>*

*    <p:output port="fileset.out" primary="true"/>*
*    <p:output port="in-memory.out" sequence="true" primary="true"
kind="in-memory"/>*
*</p:declare-step>*

This would greatly reduce the size of the pipeline:

*<p:declare-step ...>*
*    <p:documentation>Convert from HTML to EPUB and validate
input/output.</p:documentation>*

*    <p:option name="input-html-href" required="true"/>*
*    <p:option name="output-epub-href" required="true"/>*

*    <px:html-load name="html-load">*
*        <p:with-option name="href" select="$input-html-href"/>*
*    </px:html-load>*

*    <px:html-validate name="html-validate"/>*

*    <px:html-to-epub name="html-to-epub"/>*

*    <px:epub-validate name="epub-validate"/>*

*    <px:epub-store name="epub-store">*
*        <p:with-option name="href" select="$output-epub-href"/>*
*    </px:epub-store>*

*</p:declare-step>*





Jostein


On 19 February 2014 10:25, James Fuller <jim@webcomposite.com> wrote:

> A common idiom used in XProc is to define a manifest of
> documents/assets to work on and have that flow through the pipeline vs
> data documents flowing through.
>
> Typically, its a collection of URI's that each require a pipeline of
> processing for each different content type / data type which then gets
> aggregated up into some final result structure.
>
> This approach sometimes leads to convoluted 'procedural' pipelines ...
> which are less reusable and harder to comprehend.
>
> Even with non-xml data flowing through (as proposed for v2), for
> example a zip file (EPUB), we have the same class of problem where the
> zip manifest is our routing table determining processing of secondary
> data assets.
>
> I would like to dig deeper into how we might be able to make life
> easier with these kind of pipelines
>
> Imagine passing a sequence of uris to a pipeline as primary input; the
> pipeline's main responsibility is to deal with end result of
> processing (serialisation, etc) where each individual content type is
> processed by a separate pipeline.
>
> I can imagine a lot of ways of building this kind of thing with XProc
> v1 (and have) but wondering what could we enhance/add to vnext to
> simplify, making things easier to (re)use ? The problems I see are;
>
> * how to deal with mapping a step/pipeline to a content type ?
> * default posture - mutation in place vs copy of data ?
> * dependencies - some uris need to be processed before others
>
> there are other issues that need thinking through but thought I would
> 'toss over the wall' to solicit opinion.
>
> Jim Fuller
>
>
Received on Wednesday, 19 February 2014 11:36:48 UTC