Re: Annotations for side effects and stability

/ Jeni Tennison <jeni@jenitennison.com> was heard to say:
| 1. Omit the step if its outputs aren't used. For example, p:unused in
| this pipeline:
|
|    <p:pipeline>
|      <p:output ref="foo" />
|      <p:step name="p:unused">
|        <p:input href="unused.xml" />
|        <p:output label="unused" />
|      </p:step>
|      <p:step name="p:foo">
|        <p:input href="foo.xml" />
|        <p:output label="foo" />
|      </p:step>
|    </p:pipeline>

We could make that an error.

| 2. Run the step multiple times. For example, p:reused in this pipeline:
|
|    <p:pipeline>
|      <p:output ref="bar" />
|      <p:step name="p:reused">
|        <p:input href="reused.xml" />
|        <p:output label="reused" />
|      </p:step>
|      <p:step name="p:foo">
|        <p:input ref="reused" />
|        <p:output label="foo" />
|      </p:step>
|      <p:step name="p:bar">
|        <p:input name="doc1" ref="reused" />
|        <p:input name="doc2" ref="foo" />
|        <p:output label="bar" />
|      </p:step>
|    </p:pipeline>

I think that's either an error or should be defined as a syntactic
shortcut for

    <p:pipeline>
      <p:output ref="bar" />
      <p:step name="p:reused">
        <p:input href="reused.xml" />
        <p:output label="reused-tee" />
      </p:step>

      <p:step name="p:tee">
        <p:input ref="reused-tee"/>
        <p:output label="t1"/>
        <p:output label="t2"/>

      <p:step name="p:foo">
        <p:input ref="t1" />
        <p:output label="foo" />
      </p:step>
      <p:step name="p:bar">
        <p:input name="doc1" ref="t2" />
        <p:input name="doc2" ref="foo" />
        <p:output label="bar" />
      </p:step>
    </p:pipeline>

| 3. Reorder the steps in the pipeline, e.g. parallel execution. For
| example, running p:second before, or at the same time as, p:first in
| this pipeline:
|
|    <p:pipeline>
|      <p:output ref="foo" />
|      <p:step name="p:first">
|        <p:input href="first.xml" />
|        <p:output label="first" />
|      </p:step>
|      <p:step name="p:second">
|        <p:input href="second.xml" />
|        <p:output label="second" />
|      </p:step>
|      <p:step name="p:foo">
|        <p:input name="doc1" ref="first" />
|        <p:input name="doc2" ref="second" />
|        <p:output label="foo" />
|      </p:step>
|    </p:pipeline>

Right. The "flow graph" for this pipeline has no input/output
connection between p:first and p:second so I think a pipeline engine
should have complete freedom to choose the execution order.

I think there are going to be cases where steps have dependencies that
can't conveniently be described in terms of input/output connections,
so I think we'll need a way of expressing other connections. So far,
the ability to describe other resources (URIs) that are produced or
consumed covers all the cases I can think of. Note that authors can
use this feature to establish an arbitrary execution order.

Note also that a "resource dependency" blocks streaming, AFIACS. If
step "A" produces output that step "B" inputs, you can stream across A
to B. But if "A" produces an auxiliar resource "uri-a" that "B"
consumes, I think you have to run A to completion before you can start
B.

| 4. Use cached results of the component invoked in the same way in the
| same pipeline invocation. For example, using 'copy1' rather than 'copy2'
| in the p:foo step in this pipeline:
|
|    <p:pipeline>
|      <p:output ref="foo" />
|      <p:step name="p:copy">
|        <p:input href="copy.xml" />
|        <p:output label="copy1" />
|      </p:step>
|      <p:step name="p:copy">
|        <p:input href="copy.xml" />
|        <p:output label="copy2" />
|      </p:step>
|      <p:step name="p:foo">
|        <p:input name="doc1" ref="copy1" />
|        <p:input name="doc2" ref="copy2" />
|        <p:output label="foo" />
|      </p:step>
|    </p:pipeline>

From a component level, if we assume that authors can use the resource
dependencies to force an execution order, the remaining issue is
re-execution. I think it's reasonable for components to simply
indicate whether or not they're idempotent. XInclude is, a web service
component isn't.

If p:copy is idempotent, the pipeline engine can skip the second
execution and make copy2 an alias for copy1. Otherwise, it must
execute p:copy twice and the results are whatever they are.

| 3. Use cached results of the component invoked in the same way in a
| different pipeline invocation. For example, cache the 'foo' document in
| this pipeline and reuse it the next time the pipeline is invoked,
| assuming that foo.xml hasn't changed in the meantime:
|
|    <p:pipeline>
|      <p:output ref="foo" />
|      <p:step name="p:foo">
|        <p:input href="foo.xml" />
|        <p:output label="foo" />
|      </p:step>
|    </p:pipeline>

I really want to call that "out of scope".

| I think there are two things that effect which of these optimisations
| can be carried out:
|
| A. Whether the step has side effects: it does something other than
| generating the outputs defined for the step. Updating a database is
| an example.
|
| B. Whether the step uses information other than the inputs and
| parameters (and invocation environment, whatever we decide that is) to
| determine the output. There are three levels to this:
|
|   - unstable steps
|   - steps that are stable within a particular pipeline invocation
|   - steps that are stable between pipeline invocations
|
| There might not be any distinction between stability within and between
| pipeline invocations: it really depends on what the invocation
| environment is, and indeed whether there is one at all -- what extra
| information gets passed to the components aside from the inputs and
| parameters? For example, if the pipeline engine acts as a resource
| manager, providing a URI/document mapping, then a step that accesses a
| web page with stock price information would be stable within a
| particular pipeline invocation (because the same document would always
| be returned for the URI) but not between invocations. Similarly, if
| the invocation environment includes the current date and time, then
| XSLT and Timestamp components would be expected to use that date/time,
| assigned at the point the pipeline was invoked rather than at the
| point the step was run.
|
| Steps with side-effects and unstable steps have fixed relationships with
| each other: if a step has side-effects or is unstable then all steps
| that appear before it in the pipeline definition and that either have
| side-effects or are unstable must run before it, and all such steps that
| appear after it in the pipeline definition must run after it. When these
| steps occur within an iteration, then the iteration must be done in
| order. Of course, we could provide mechanisms to indicate exactly
| which steps rely on which others, such as a depends attribute or
| anonymous inputs and outputs. Stable steps without side effects can be
| reordered as desired.

I want to keep this as simple as possible. I'd be happier in V1 saying
that a pipeline engine must always do the safe thing (as you describe
above) rather than giving authors the ability to provide finer
granularity dependencies.

| If a step has side-effects then it must be run exactly once. Steps
| without side-effects can be omitted or run several times, though if an
| unstable step is run multiple times then all but the first invocation
| must be ignored, as the result might be different each time.
|
| The outputs of stable steps can be cached and reused. As long as it's
| stable, the outputs of a step with side effects are still cachable:
| the pipeline engine has to run the step anyway, and can't get on with
| other steps until it's finished, but could possibly glean some
| performance benefit from reusing outputs if it meant it didn't have to
| re-parse a large XML document, for example. An example of a stable
| step with side effects is one that takes an XML document, updates a
| database with the data it contains, and returns the same XML document
| as the result.
|
| Whether or not a step has side-effects and how stable it is depends on
| both the component that runs the step and the way the step itself is
| set up. Taking XSLT 1.0 as an example: if an XSLT 1.0 component is
| defined as only having one output (the result document it generates)
| then it would have to be classified as (potentially) having side
| effects, since there's no output to capture messages generated with
| <xsl:message>. If the component were defined with an output for
| messages, then it would be side-effect-free. But the stylesheet used
| in a particular step also determines the classification of the step: a
| stylesheet that didn't contain any <xsl:message> instructions would be
| side-effect free however the component were defined; a stylesheet that
| included extensions that carried out database updates would not be
| side-effect free.

Indeed.

                                        Be seeing you,
                                          norm

-- 
Norman Walsh
XML Standards Architect
Sun Microsystems, Inc.

Received on Monday, 24 April 2006 15:02:49 UTC