Re: The first five minutes ... a thought experiment (long)

----- Original Message -----
> From: "James Fuller" <jim@webcomposite.com>
> To: "Paul Mensonides" <pmenso57@comcast.net>

> Documents flowing through a pipeline is a fundamental concept in xproc
> eg. data flowing through pipe whose connections to steps are defined
> by bindings. This is classic data flow language, though the decision
> in v1 was to only allow XML documents flow through.
> 
> In XProc vnext we are considering allowing item()* with non xml
> documents flowing through pipes, which would address your requirement
> (I think)

At this point it hasn't been a requirement, per se, though I have had to fight with it when an XSLT step was generating a Graphviz text file.  I just thought that it was odd given the semi-shared object model of XML Schema, XSLT, XQuery, and XPath.

> yes explicitly setting up a pipeline removes doubt ... which is
> counter to usability
> 
> we will have a much better defaulting story with optimized syntax
> changes that should address issues in this area in vnext.

To be clear, I have not run into issues with the way that it is now.  The default piping itself is what actually gives me pause from time to time.  For example:

<?xml version="1.0" encoding="UTF-8"?>
<p:pipeline xmlns:p="http://www.w3.org/ns/xproc" version="1.0" xpath-version="2.0">
    <p:serialization port="result"
        version="1.0" omit-xml-declaration="false"
        encoding="UTF-8" indent="true"/>
    <p:xslt name="preprocess" version="2.0">
        <p:input port="stylesheet">
            <p:document href="preprocess.xsl"/>
        </p:input>
    </p:xslt>
    <p:store href="preprocess.xml"
        version="1.0" omit-xml-declaration="false"
        encoding="UTF-8" indent="true"/>
    <p:xslt name="transform" version="2.0">
        <p:input port="source">
            <p:pipe step="preprocess" port="result"/>
        </p:input>
        <p:input port="stylesheet">
            <p:document href="data.xsl"/>
        </p:input>
    </p:xslt>
</p:pipeline>

Here the "preprocess" step implicitly pipes the pipeline's implicit "source" input to its own "source" input.  It also implicitly pipes the pipeline's implicit "parameters" input to its own "parameters" input.

The p:store step implicitly pipes the "preprocess" step's "result" output to its own "source" input.

The "transform" step (apparently) implicitly pipes the pipeline's implicit "parameters" input to its own "parameters" input.  However, it does _not_ implicitly pipe the last "result" output (i.e. from the "preprocess" step) to its own "source" input.  I.e. if I comment out the explicit pipe...

    <p:xslt name="transform" version="2.0">
        <!--<p:input port="source">
            <p:pipe step="preprocess" port="result"/>
        </p:input>-->
        <p:input port="stylesheet">
            <p:document href="data.xsl"/>
        </p:input>
    </p:xslt>

...the pipeline fails.  What it looks like to me is that the "parameters" input of the pipeline can be implicitely piped more than once, but not normal inputs.  So I put in the explicit pipe while thinking to myself: this type of branching happens so often that I might as well just specify the bindings myself because apparently primary input ports do not automatically map to the first available primary output port, and, even if they did, that would be brittle.  E.g. the above is basically:

      p:xslt
      /    \
     /      \
  p:store  p:xslt

but if instead I need to add a step before the p:store branch

                        p:xslt
                        /    \
                       /      \
  p:validate-with-xml-schema  p:xslt
                      |
                      |
                   p:store

that would instantly break the previous implicit binding.  The moral of story is that, for me, branching pipeline + implicit anything is brittle and error-prone (and in a non-local way).

> > I haven't run across it yet, but I am worried about the lack of the
> > ability
> > to cache intermediate results in a direct way. Viewing a pipeline as
> > a sort
> > of makefile, running the pipeline is equivalent to a complete
> > rebuild. For
> > the project that I am using to learn all of this stuff, this doesn't
> > matter
> > that much. For the real world projects that I need something like
> > this for,
> > I fear it will potentially be a very large problem, and it may be
> > that I
> > have to have small partial pipelines being invoked via a makefile.
> > The
> > potential benefit of streaming over serialization and infosets (or
> > whatever
> > they are called) versus re-parsing are unrealized in this sort of
> > scenario.
> 
> vnext calls for better logging and debugging of pipelines ... any
> examples past wanting to log output of a step appreciated.

One of the things that I need something like XProc for is to generate a bunch documentation.  The HTML output ends up being roughly 1000 separate HTML files.  The source data starts with an XML file that is a manifest referencing other source XML files.  One of the required steps is to go through all of these files and generate a lookup table for cross-referencing.  That process may also generate a bootstrapped XML Schema which is imported into the various schemas for the documents themselves.  After that initial step, documents are validated and transformed in various ways (using the lookup table, etc.).  My current setup of this--which was from the XSLT 1.0 days with and several Bash scripts (including some being generated by XSLT)--takes about 5-10 minutes to regenerate the documentation (depending on computer of course) and it would be worse if it was also generating LaTeX or (unfamiliar to me) XSL-FO.  So, I would like to be able to avoid rebuilding those things which are not affected by the changes to the input.  Normally, I'd just immediately go to makefile, but that just uses timestamps.  The potential benefit with using XML is that I should be able to determine whether particular changes *inside* a document actually effect other documents and thus cause them to be recreated.  I.e. with make, the lookup table created from all of the files, so if any file is changed, everything has to be rebuilt because everything depends on that lookup table.

IOW, a pipeline is essentially a build process.  In particular when it comes to XML data of various kinds, there are several potential benefits over a generic build tool like make.  One is streaming rather than serializing.  This is particularly true if data flowing between steps does not need to be complete documents (i.e. among other things, that provides the possibility for the pipeline author to "help" the system recognize streamability of various subprocesses).  The other is potentially about having finer granularity in determining whether a change in a dependency actually affects subsequent downstream steps and divert around them or skip them if not.  Technically you _can_ do this with a makefile, it just isn't very natural (and it isn't very natural with current XProc either).

Regards,
Paul Mensonides

Received on Wednesday, 19 February 2014 02:48:54 UTC