Re: A "processing model" proposal

Richard & all,

Here is how you do this with XPL [1]:

<p:pipeline version = "1.0"
             xmlns:p="http://www.orbeon.com/oxf/pipeline"
             xmlns:xpl="http://www.orbeon.com/oxf/xpl/standard"
             xmlns:my="http://www.example.org/xpl/my-components">

     <p:input name="document-1"/>
     <p:input name="document-2"/>
     <p:output name="result" infosetref="diff"/>

     <p:processor name="xpl:xslt">
         <p:input name="data" infosetref="#document-1"/>
         <p:input name="stylesheet" infosetref="strip-ids.xsl"/>
         <p:output name="data" infoset="stripped-1"/>
     </p:processor>

     <p:processor name="xpl:xslt">
         <p:input name="data" infosetref="#document-2"/>
         <p:input name="stylesheet" infosetref="strip-ids.xsl"/>
         <p:output name="data" infoset="stripped-2"/>
     </p:processor>

     <p:processor name="my:diff">
         <p:input name="doc1" infosetref="#stripped-1"/>
         <p:input name="doc2" infosetref="#stripped-2"/>
         <p:output name="diff" infoset="diff"/>
     </p:processor>

</p:pipeline>

The syntax is straightforward, but feel free to ask questions. Note
that the stylesheets are *not* "parameters", as there is no reason to
make a difference between the stylesheet and the main input of the
transformation: both are XML documents.

The processing model is lazy:

1. The caller of the pipeline may read the "result" output

2. This means the pipeline must execute the "my:diff" processor.

3. In turn that processor requests its inputs: "doc1" and "doc2".

4. When "doc1" is requested, the first "xpl:xslt" processor must run.

5. That processor in turn decides what it needs: "stylesheet" and
    "data" input.

6. When the "my:diff" processoris done, it will read its "doc2" input,
    which through a similar process obtains the associated
    infoset. Then it can produce its "diff" output.

Quite straightforward as well. You can imagine an implementation where
"doc1" and "doc2" are fetched in parallel.

The "lazy" approach is really very natural when you implement the
pipeline engine, as it comes down to just looking at what you need and
getting it.

-Erik

[1] http://www.w3.org/Submission/xpl/

Richard Tobin wrote:
 > Here is my "diff" use case that I talked about.
 >
 > Suppose we have two XML documents that we want to compare.  But these
 > documents have some irrelevant features - automatically assigned id
 > attributes say - that we don't want to count as differences.  So we
 > run each file through an XSLT stylesheet to strip out those features
 > before running diff.
 >
 > Now with a graphical interface it would be easy to draw the pipeline
 > we want.  There would be two lines coming in at the top.  Each would
 > go to a box (a step) consisting of an XSLT transform with a parameter
 > specifying the stylesheet.  Each XSLT box would have a line coming
 > out, and the two lines would go down to a diff box, which would have
 > one line coming out at the bottom.
 >
 > Writing this pipeline as a unix shell script is straightforward but
 > ugly, because we have to use temporary files as the shell doesn't let
 > us write a command with two inputs from other programs:
 >
 >    #!/bin/sh
 >    lxt -s strip-ids.xsl <$1 >/tmp/t1
 >    lxt -s strip-ids.xsl <$2 >/tmp/t2
 >    lxdiff /tmp/t1 /tmp/t2
 >
 > (lxt is my XSLT processor, lxdiff is my diff program).
 >
 > I assumed that the inputs were specified by filenames, again because
 > the shell doesn't have a way to let me hook up two general inputs, but I
 > could have used numbered file descriptors instead.  This version uses
 > whatever is connected to file descriptors 5 and 6 of the script:
 >
 >    #!/bin/sh
 >    lxt -s strip-ids.xsl <&5 >/tmp/t1
 >    lxt -s strip-ids.xsl <&6 >/tmp/t2
 >    lxdiff /tmp/t1 /tmp/t2
 >
 > In fact bash does provide a syntax for hooking up multiple arguments
 > to pipes, so we could avoid the temporary files:
 >
 >    #!/bin/sh
 >    lxdiff <(lxt -s strip-ids.xsl <&5) <(lxt -s strip-ids.xsl <&6)
 >
 > How could we do this in a pipeline language?  The obvious solution is
 > to name the inputs and outputs, with some simplifying convention for
 > the usual case where there is only one input and output.  But these
 > names are entirely local to the pipeline: they don't have to be
 > globally unique like the temporary files in the shell script example,
 > which will go wrong if two instances are run at once.  If we compare
 > it with the graphical representation, we effectively have to label the
 > lines.  A possible syntax would be:
 >
 >  <pipeline inputs="i1 i2">
 >    <step type="xslt">
 >      <input name="i1"/>
 >      <param name="stylesheet" value="strip-ids.xsl"/>
 >      <output name="o1"/>
 >    </step>
 >    <step type="xslt">
 >      <input name="i2"/>
 >      <param name="stylesheet" value="strip-ids.xsl"/>
 >      <output name="o2"/>
 >    </step>
 >    <step type="diff">
 >      <input name="o1"/>
 >      <input name="o2"/>
 >    </step>
 >  </pipeline>
 >
 > And the simplifying convention would be that if a <step> has no
 > <input> child then its input is the first output of the lexically
 > preceding step, and that inputs and outputs need not be named unless
 > the names are needed (the pipeline and the diff step use this
 > simplification for their output).
 >
 > -- Richard
 >

Received on Thursday, 16 February 2006 23:32:46 UTC