Annotations for side effects and stability

Hi,

Trying to put my thoughts into order... Given that we annotate 
components/steps to indicate what optimisations a pipeine engine can do, 
what are those annotations going to be and what do they permit the 
pipeline engine to do?

To summarise: I think we need to provide annotations at both the 
component level and the step level, indicating whether the 
component/step has side-effects or not, and how stable it is. We also 
need to be clear about what extra environment information is passed to 
each component.

First, let's consider some possible optimisations that a pipeline engine 
might want to do:

1. Omit the step if its outputs aren't used. For example, p:unused in
this pipeline:

    <p:pipeline>
      <p:output ref="foo" />
      <p:step name="p:unused">
        <p:input href="unused.xml" />
        <p:output label="unused" />
      </p:step>
      <p:step name="p:foo">
        <p:input href="foo.xml" />
        <p:output label="foo" />
      </p:step>
    </p:pipeline>

2. Run the step multiple times. For example, p:reused in this pipeline:

    <p:pipeline>
      <p:output ref="bar" />
      <p:step name="p:reused">
        <p:input href="reused.xml" />
        <p:output label="reused" />
      </p:step>
      <p:step name="p:foo">
        <p:input ref="reused" />
        <p:output label="foo" />
      </p:step>
      <p:step name="p:bar">
        <p:input name="doc1" ref="reused" />
        <p:input name="doc2" ref="foo" />
        <p:output label="bar" />
      </p:step>
    </p:pipeline>

3. Reorder the steps in the pipeline, e.g. parallel execution. For
example, running p:second before, or at the same time as, p:first in
this pipeline:

    <p:pipeline>
      <p:output ref="foo" />
      <p:step name="p:first">
        <p:input href="first.xml" />
        <p:output label="first" />
      </p:step>
      <p:step name="p:second">
        <p:input href="second.xml" />
        <p:output label="second" />
      </p:step>
      <p:step name="p:foo">
        <p:input name="doc1" ref="first" />
        <p:input name="doc2" ref="second" />
        <p:output label="foo" />
      </p:step>
    </p:pipeline>

4. Use cached results of the component invoked in the same way in the
same pipeline invocation. For example, using 'copy1' rather than 'copy2'
in the p:foo step in this pipeline:

    <p:pipeline>
      <p:output ref="foo" />
      <p:step name="p:copy">
        <p:input href="copy.xml" />
        <p:output label="copy1" />
      </p:step>
      <p:step name="p:copy">
        <p:input href="copy.xml" />
        <p:output label="copy2" />
      </p:step>
      <p:step name="p:foo">
        <p:input name="doc1" ref="copy1" />
        <p:input name="doc2" ref="copy2" />
        <p:output label="foo" />
      </p:step>
    </p:pipeline>

3. Use cached results of the component invoked in the same way in a
different pipeline invocation. For example, cache the 'foo' document in
this pipeline and reuse it the next time the pipeline is invoked,
assuming that foo.xml hasn't changed in the meantime:

    <p:pipeline>
      <p:output ref="foo" />
      <p:step name="p:foo">
        <p:input href="foo.xml" />
        <p:output label="foo" />
      </p:step>
    </p:pipeline>

I think there are two things that effect which of these optimisations
can be carried out:

A. Whether the step has side effects: it does something other than
generating the outputs defined for the step. Updating a database is
an example.

B. Whether the step uses information other than the inputs and
parameters (and invocation environment, whatever we decide that is) to
determine the output. There are three levels to this:

   - unstable steps
   - steps that are stable within a particular pipeline invocation
   - steps that are stable between pipeline invocations

There might not be any distinction between stability within and between
pipeline invocations: it really depends on what the invocation
environment is, and indeed whether there is one at all -- what extra 
information gets passed to the components aside from the inputs and 
parameters? For example, if the pipeline engine acts as a resource 
manager, providing a URI/document mapping, then a step that accesses a 
web page with stock price information would be stable within a 
particular pipeline invocation (because the same document would always 
be returned for the URI) but not between invocations. Similarly, if the 
invocation environment includes the current date and time, then XSLT and 
Timestamp components would be expected to use that date/time, assigned 
at the point the pipeline was invoked rather than at the point the step 
was run.

Steps with side-effects and unstable steps have fixed relationships with
each other: if a step has side-effects or is unstable then all steps
that appear before it in the pipeline definition and that either have
side-effects or are unstable must run before it, and all such steps that
appear after it in the pipeline definition must run after it. When these
steps occur within an iteration, then the iteration must be done in
order. Of course, we could provide mechanisms to indicate exactly which 
steps rely on which others, such as a depends attribute or anonymous 
inputs and outputs. Stable steps without side effects can be reordered 
as desired.

If a step has side-effects then it must be run exactly once. Steps
without side-effects can be omitted or run several times, though if an
unstable step is run multiple times then all but the first invocation
must be ignored, as the result might be different each time.

The outputs of stable steps can be cached and reused. As long as it's
stable, the outputs of a step with side effects are still cachable: the 
pipeline engine has to run the step anyway, and can't get on with other 
steps until it's finished, but could possibly glean some performance 
benefit from reusing outputs if it meant it didn't have to re-parse a 
large XML document, for example. An example of a stable step with side 
effects is one that takes an XML document, updates a database with the 
data it contains, and returns the same XML document as the result.

Whether or not a step has side-effects and how stable it is depends on 
both the component that runs the step and the way the step itself is set 
up. Taking XSLT 1.0 as an example: if an XSLT 1.0 component is defined 
as only having one output (the result document it generates) then it 
would have to be classified as (potentially) having side effects, since 
there's no output to capture messages generated with <xsl:message>. If 
the component were defined with an output for messages, then it would be 
side-effect-free. But the stylesheet used in a particular step also 
determines the classification of the step: a stylesheet that didn't 
contain any <xsl:message> instructions would be side-effect free however 
the component were defined; a stylesheet that included extensions that 
carried out database updates would not be side-effect free.

Cheers,

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com

Received on Monday, 24 April 2006 10:21:50 UTC