Split and eval, the case for arbitrary numbers of outputs from Norman Walsh on 2012-04-26 (public-xml-processing-model-wg@w3.org from April 2012)

From: Norman Walsh <ndw@nwalsh.com>
Date: Thu, 26 Apr 2012 09:31:41 -0400
To: public-xml-processing-model-wg@w3.org
Message-ID: <m2ehrar54y.fsf@nwalsh.com>
Per my action from last week...

Part of my plan for (re)implementing my XProc processor involves performing
more aggressive graph analysis. This has two benefits: first, I'll be able
to establish thread boundaries and do multi-threaded processing and second,
I'll be able to identify (sub)pipelines that can be streamed.

In order to make the graph more amenable to this sort of streaming and
rewriting, I'm transforming the user's pipeline into something with
explicit steps for actions like splitting.

Consider this pipeline fragment:

  <p:identity name="root"/>

  <p:identity name="branch1">
    <p:input port="source">
      <p:pipe step="root" port="result"/>
    </p:input>
  </p:identity>

  <p:identity name="branch2">
    <p:input port="source">
      <p:pipe step="root" port="result"/>
    </p:input>
  </p:identity>

The two identity steps branch1 and branch2 both read from the same
"result" port on the "root" step. At an implementation level that requires
some sort of buffering or copying. I want to make that explicit, so
I'm introducing an explicit split step:

  <p:identity name="root"/>

  <internal:split name="ID00001">

  <p:identity name="branch1">
    <p:input port="source">
      <p:pipe step="ID00001" port="result1"/>
    </p:input>
  </p:identity>

  <p:identity name="branch2">
    <p:input port="source">
      <p:pipe step="ID00001" port="result2"/>
    </p:input>
  </p:identity>

So what's the declaration for the internal:split step? It's something
like this:

  <p:declare-step type="internal:split">
    <p:input port="source" sequence="true" primary="true"/>
    <p:output port="result1" sequence="true" primary="false"/>
    <p:output port="result2" sequence="true" primary="false"/>
  </p:declare-step>

And I could declare internal:split2, internal:split3, etc. steps. But
really this is just a magic step with an arbitrary number of output
ports.

The same problem exists if you want to write an eval step:

<p:declare-step type="cx:eval">
   <p:input port="pipeline"/>
   <p:input port="source" sequence="true"/>
   <p:input port="options"/>
   <p:output port="result"/>
   <p:option name="step" cx:type="xsd:QName"/>
   <p:option name="detailed" cx:type="xsd:boolean"/>
</p:declare-step>

This is a step that takes *an XML pipeline document* as it's input,
compiles it, and runs it. The problem, of course, is that the number
of inputs and outputs that this step needs is determined entirely by
the input pipeline which isn't known at compile-time and may actually
be different on every invocation.

I work around this in XML Calabash by encoding the multiple inputs
and outputs into a single document. That works (sortof) for XProc 1.0
because the documents all have to be XML. It won't work at all if
we allow non-XML documents.

(No, a sequence of inputs and outputs isn't sufficient because you
have to be able to map sequences of inputs and outputs to different
port names.)

                                        Be seeing you,
                                          norm

-- 
Norman Walsh
Lead Engineer
MarkLogic Corporation
Phone: +1 413 624 6676
www.marklogic.com
Received on Thursday, 26 April 2012 13:32:19 UTC