Re: A "processing model" proposal

Here is my "diff" use case that I talked about.

Suppose we have two XML documents that we want to compare.  But these
documents have some irrelevant features - automatically assigned id
attributes say - that we don't want to count as differences.  So we
run each file through an XSLT stylesheet to strip out those features
before running diff.

Now with a graphical interface it would be easy to draw the pipeline
we want.  There would be two lines coming in at the top.  Each would
go to a box (a step) consisting of an XSLT transform with a parameter
specifying the stylesheet.  Each XSLT box would have a line coming
out, and the two lines would go down to a diff box, which would have
one line coming out at the bottom.

Writing this pipeline as a unix shell script is straightforward but
ugly, because we have to use temporary files as the shell doesn't let
us write a command with two inputs from other programs:

   #!/bin/sh
   lxt -s strip-ids.xsl <$1 >/tmp/t1
   lxt -s strip-ids.xsl <$2 >/tmp/t2
   lxdiff /tmp/t1 /tmp/t2

(lxt is my XSLT processor, lxdiff is my diff program).

I assumed that the inputs were specified by filenames, again because
the shell doesn't have a way to let me hook up two general inputs, but I
could have used numbered file descriptors instead.  This version uses
whatever is connected to file descriptors 5 and 6 of the script:

   #!/bin/sh
   lxt -s strip-ids.xsl <&5 >/tmp/t1
   lxt -s strip-ids.xsl <&6 >/tmp/t2
   lxdiff /tmp/t1 /tmp/t2

In fact bash does provide a syntax for hooking up multiple arguments
to pipes, so we could avoid the temporary files:

   #!/bin/sh
   lxdiff <(lxt -s strip-ids.xsl <&5) <(lxt -s strip-ids.xsl <&6) 

How could we do this in a pipeline language?  The obvious solution is
to name the inputs and outputs, with some simplifying convention for
the usual case where there is only one input and output.  But these
names are entirely local to the pipeline: they don't have to be
globally unique like the temporary files in the shell script example,
which will go wrong if two instances are run at once.  If we compare
it with the graphical representation, we effectively have to label the
lines.  A possible syntax would be:

 <pipeline inputs="i1 i2">
   <step type="xslt">
     <input name="i1"/>
     <param name="stylesheet" value="strip-ids.xsl"/>
     <output name="o1"/>
   </step>
   <step type="xslt">
     <input name="i2"/>
     <param name="stylesheet" value="strip-ids.xsl"/>
     <output name="o2"/>
   </step>
   <step type="diff">
     <input name="o1"/>
     <input name="o2"/>
   </step>
 </pipeline>

And the simplifying convention would be that if a <step> has no
<input> child then its input is the first output of the lexically
preceding step, and that inputs and outputs need not be named unless
the names are needed (the pipeline and the diff step use this
simplification for their output).

-- Richard

Received on Thursday, 16 February 2006 17:47:24 UTC