- From: Norman Walsh <Norman.Walsh@Sun.COM>
- Date: Mon, 24 Apr 2006 11:02:17 -0400
- To: public-xml-processing-model-wg@w3.org
- Message-ID: <877j5f6mpy.fsf@nwalsh.com>
/ Jeni Tennison <jeni@jenitennison.com> was heard to say: | 1. Omit the step if its outputs aren't used. For example, p:unused in | this pipeline: | | <p:pipeline> | <p:output ref="foo" /> | <p:step name="p:unused"> | <p:input href="unused.xml" /> | <p:output label="unused" /> | </p:step> | <p:step name="p:foo"> | <p:input href="foo.xml" /> | <p:output label="foo" /> | </p:step> | </p:pipeline> We could make that an error. | 2. Run the step multiple times. For example, p:reused in this pipeline: | | <p:pipeline> | <p:output ref="bar" /> | <p:step name="p:reused"> | <p:input href="reused.xml" /> | <p:output label="reused" /> | </p:step> | <p:step name="p:foo"> | <p:input ref="reused" /> | <p:output label="foo" /> | </p:step> | <p:step name="p:bar"> | <p:input name="doc1" ref="reused" /> | <p:input name="doc2" ref="foo" /> | <p:output label="bar" /> | </p:step> | </p:pipeline> I think that's either an error or should be defined as a syntactic shortcut for <p:pipeline> <p:output ref="bar" /> <p:step name="p:reused"> <p:input href="reused.xml" /> <p:output label="reused-tee" /> </p:step> <p:step name="p:tee"> <p:input ref="reused-tee"/> <p:output label="t1"/> <p:output label="t2"/> <p:step name="p:foo"> <p:input ref="t1" /> <p:output label="foo" /> </p:step> <p:step name="p:bar"> <p:input name="doc1" ref="t2" /> <p:input name="doc2" ref="foo" /> <p:output label="bar" /> </p:step> </p:pipeline> | 3. Reorder the steps in the pipeline, e.g. parallel execution. For | example, running p:second before, or at the same time as, p:first in | this pipeline: | | <p:pipeline> | <p:output ref="foo" /> | <p:step name="p:first"> | <p:input href="first.xml" /> | <p:output label="first" /> | </p:step> | <p:step name="p:second"> | <p:input href="second.xml" /> | <p:output label="second" /> | </p:step> | <p:step name="p:foo"> | <p:input name="doc1" ref="first" /> | <p:input name="doc2" ref="second" /> | <p:output label="foo" /> | </p:step> | </p:pipeline> Right. The "flow graph" for this pipeline has no input/output connection between p:first and p:second so I think a pipeline engine should have complete freedom to choose the execution order. I think there are going to be cases where steps have dependencies that can't conveniently be described in terms of input/output connections, so I think we'll need a way of expressing other connections. So far, the ability to describe other resources (URIs) that are produced or consumed covers all the cases I can think of. Note that authors can use this feature to establish an arbitrary execution order. Note also that a "resource dependency" blocks streaming, AFIACS. If step "A" produces output that step "B" inputs, you can stream across A to B. But if "A" produces an auxiliar resource "uri-a" that "B" consumes, I think you have to run A to completion before you can start B. | 4. Use cached results of the component invoked in the same way in the | same pipeline invocation. For example, using 'copy1' rather than 'copy2' | in the p:foo step in this pipeline: | | <p:pipeline> | <p:output ref="foo" /> | <p:step name="p:copy"> | <p:input href="copy.xml" /> | <p:output label="copy1" /> | </p:step> | <p:step name="p:copy"> | <p:input href="copy.xml" /> | <p:output label="copy2" /> | </p:step> | <p:step name="p:foo"> | <p:input name="doc1" ref="copy1" /> | <p:input name="doc2" ref="copy2" /> | <p:output label="foo" /> | </p:step> | </p:pipeline> From a component level, if we assume that authors can use the resource dependencies to force an execution order, the remaining issue is re-execution. I think it's reasonable for components to simply indicate whether or not they're idempotent. XInclude is, a web service component isn't. If p:copy is idempotent, the pipeline engine can skip the second execution and make copy2 an alias for copy1. Otherwise, it must execute p:copy twice and the results are whatever they are. | 3. Use cached results of the component invoked in the same way in a | different pipeline invocation. For example, cache the 'foo' document in | this pipeline and reuse it the next time the pipeline is invoked, | assuming that foo.xml hasn't changed in the meantime: | | <p:pipeline> | <p:output ref="foo" /> | <p:step name="p:foo"> | <p:input href="foo.xml" /> | <p:output label="foo" /> | </p:step> | </p:pipeline> I really want to call that "out of scope". | I think there are two things that effect which of these optimisations | can be carried out: | | A. Whether the step has side effects: it does something other than | generating the outputs defined for the step. Updating a database is | an example. | | B. Whether the step uses information other than the inputs and | parameters (and invocation environment, whatever we decide that is) to | determine the output. There are three levels to this: | | - unstable steps | - steps that are stable within a particular pipeline invocation | - steps that are stable between pipeline invocations | | There might not be any distinction between stability within and between | pipeline invocations: it really depends on what the invocation | environment is, and indeed whether there is one at all -- what extra | information gets passed to the components aside from the inputs and | parameters? For example, if the pipeline engine acts as a resource | manager, providing a URI/document mapping, then a step that accesses a | web page with stock price information would be stable within a | particular pipeline invocation (because the same document would always | be returned for the URI) but not between invocations. Similarly, if | the invocation environment includes the current date and time, then | XSLT and Timestamp components would be expected to use that date/time, | assigned at the point the pipeline was invoked rather than at the | point the step was run. | | Steps with side-effects and unstable steps have fixed relationships with | each other: if a step has side-effects or is unstable then all steps | that appear before it in the pipeline definition and that either have | side-effects or are unstable must run before it, and all such steps that | appear after it in the pipeline definition must run after it. When these | steps occur within an iteration, then the iteration must be done in | order. Of course, we could provide mechanisms to indicate exactly | which steps rely on which others, such as a depends attribute or | anonymous inputs and outputs. Stable steps without side effects can be | reordered as desired. I want to keep this as simple as possible. I'd be happier in V1 saying that a pipeline engine must always do the safe thing (as you describe above) rather than giving authors the ability to provide finer granularity dependencies. | If a step has side-effects then it must be run exactly once. Steps | without side-effects can be omitted or run several times, though if an | unstable step is run multiple times then all but the first invocation | must be ignored, as the result might be different each time. | | The outputs of stable steps can be cached and reused. As long as it's | stable, the outputs of a step with side effects are still cachable: | the pipeline engine has to run the step anyway, and can't get on with | other steps until it's finished, but could possibly glean some | performance benefit from reusing outputs if it meant it didn't have to | re-parse a large XML document, for example. An example of a stable | step with side effects is one that takes an XML document, updates a | database with the data it contains, and returns the same XML document | as the result. | | Whether or not a step has side-effects and how stable it is depends on | both the component that runs the step and the way the step itself is | set up. Taking XSLT 1.0 as an example: if an XSLT 1.0 component is | defined as only having one output (the result document it generates) | then it would have to be classified as (potentially) having side | effects, since there's no output to capture messages generated with | <xsl:message>. If the component were defined with an output for | messages, then it would be side-effect-free. But the stylesheet used | in a particular step also determines the classification of the step: a | stylesheet that didn't contain any <xsl:message> instructions would be | side-effect free however the component were defined; a stylesheet that | included extensions that carried out database updates would not be | side-effect free. Indeed. Be seeing you, norm -- Norman Walsh XML Standards Architect Sun Microsystems, Inc.
Received on Monday, 24 April 2006 15:02:49 UTC