- From: Jeni Tennison <jeni@jenitennison.com>
- Date: Mon, 24 Apr 2006 11:21:40 +0100
- To: public-xml-processing-model-wg@w3.org
Hi, Trying to put my thoughts into order... Given that we annotate components/steps to indicate what optimisations a pipeine engine can do, what are those annotations going to be and what do they permit the pipeline engine to do? To summarise: I think we need to provide annotations at both the component level and the step level, indicating whether the component/step has side-effects or not, and how stable it is. We also need to be clear about what extra environment information is passed to each component. First, let's consider some possible optimisations that a pipeline engine might want to do: 1. Omit the step if its outputs aren't used. For example, p:unused in this pipeline: <p:pipeline> <p:output ref="foo" /> <p:step name="p:unused"> <p:input href="unused.xml" /> <p:output label="unused" /> </p:step> <p:step name="p:foo"> <p:input href="foo.xml" /> <p:output label="foo" /> </p:step> </p:pipeline> 2. Run the step multiple times. For example, p:reused in this pipeline: <p:pipeline> <p:output ref="bar" /> <p:step name="p:reused"> <p:input href="reused.xml" /> <p:output label="reused" /> </p:step> <p:step name="p:foo"> <p:input ref="reused" /> <p:output label="foo" /> </p:step> <p:step name="p:bar"> <p:input name="doc1" ref="reused" /> <p:input name="doc2" ref="foo" /> <p:output label="bar" /> </p:step> </p:pipeline> 3. Reorder the steps in the pipeline, e.g. parallel execution. For example, running p:second before, or at the same time as, p:first in this pipeline: <p:pipeline> <p:output ref="foo" /> <p:step name="p:first"> <p:input href="first.xml" /> <p:output label="first" /> </p:step> <p:step name="p:second"> <p:input href="second.xml" /> <p:output label="second" /> </p:step> <p:step name="p:foo"> <p:input name="doc1" ref="first" /> <p:input name="doc2" ref="second" /> <p:output label="foo" /> </p:step> </p:pipeline> 4. Use cached results of the component invoked in the same way in the same pipeline invocation. For example, using 'copy1' rather than 'copy2' in the p:foo step in this pipeline: <p:pipeline> <p:output ref="foo" /> <p:step name="p:copy"> <p:input href="copy.xml" /> <p:output label="copy1" /> </p:step> <p:step name="p:copy"> <p:input href="copy.xml" /> <p:output label="copy2" /> </p:step> <p:step name="p:foo"> <p:input name="doc1" ref="copy1" /> <p:input name="doc2" ref="copy2" /> <p:output label="foo" /> </p:step> </p:pipeline> 3. Use cached results of the component invoked in the same way in a different pipeline invocation. For example, cache the 'foo' document in this pipeline and reuse it the next time the pipeline is invoked, assuming that foo.xml hasn't changed in the meantime: <p:pipeline> <p:output ref="foo" /> <p:step name="p:foo"> <p:input href="foo.xml" /> <p:output label="foo" /> </p:step> </p:pipeline> I think there are two things that effect which of these optimisations can be carried out: A. Whether the step has side effects: it does something other than generating the outputs defined for the step. Updating a database is an example. B. Whether the step uses information other than the inputs and parameters (and invocation environment, whatever we decide that is) to determine the output. There are three levels to this: - unstable steps - steps that are stable within a particular pipeline invocation - steps that are stable between pipeline invocations There might not be any distinction between stability within and between pipeline invocations: it really depends on what the invocation environment is, and indeed whether there is one at all -- what extra information gets passed to the components aside from the inputs and parameters? For example, if the pipeline engine acts as a resource manager, providing a URI/document mapping, then a step that accesses a web page with stock price information would be stable within a particular pipeline invocation (because the same document would always be returned for the URI) but not between invocations. Similarly, if the invocation environment includes the current date and time, then XSLT and Timestamp components would be expected to use that date/time, assigned at the point the pipeline was invoked rather than at the point the step was run. Steps with side-effects and unstable steps have fixed relationships with each other: if a step has side-effects or is unstable then all steps that appear before it in the pipeline definition and that either have side-effects or are unstable must run before it, and all such steps that appear after it in the pipeline definition must run after it. When these steps occur within an iteration, then the iteration must be done in order. Of course, we could provide mechanisms to indicate exactly which steps rely on which others, such as a depends attribute or anonymous inputs and outputs. Stable steps without side effects can be reordered as desired. If a step has side-effects then it must be run exactly once. Steps without side-effects can be omitted or run several times, though if an unstable step is run multiple times then all but the first invocation must be ignored, as the result might be different each time. The outputs of stable steps can be cached and reused. As long as it's stable, the outputs of a step with side effects are still cachable: the pipeline engine has to run the step anyway, and can't get on with other steps until it's finished, but could possibly glean some performance benefit from reusing outputs if it meant it didn't have to re-parse a large XML document, for example. An example of a stable step with side effects is one that takes an XML document, updates a database with the data it contains, and returns the same XML document as the result. Whether or not a step has side-effects and how stable it is depends on both the component that runs the step and the way the step itself is set up. Taking XSLT 1.0 as an example: if an XSLT 1.0 component is defined as only having one output (the result document it generates) then it would have to be classified as (potentially) having side effects, since there's no output to capture messages generated with <xsl:message>. If the component were defined with an output for messages, then it would be side-effect-free. But the stylesheet used in a particular step also determines the classification of the step: a stylesheet that didn't contain any <xsl:message> instructions would be side-effect free however the component were defined; a stylesheet that included extensions that carried out database updates would not be side-effect free. Cheers, Jeni -- Jeni Tennison http://www.jenitennison.com
Received on Monday, 24 April 2006 10:21:50 UTC