RE: about bug 29472 and turning streaming on and off

See also my comments between the lines.

In summary, a lot of your argument resolves around the fact that you seem to think that a stylesheet is either streamable or not. But we do not have a concept of streamable stylesheets (or streamable modules or packages). We have a concept of streamable constructs.

Another (big) part of your argument is control from the API or commandline. The irony is that it is precisely that point that initiated the xsl:stream/@streamable="yes|no" discussion. Without that attribute it is not possible to control it via the commandline. Consider:

<xsl:param name="stream" select=" 'no' " static="yes" />

<xsl:mode _streamable="{$stream}"/>

<xsl:template match="/">
   <xsl:stream href="strm.xml">
     <xsl:apply-templates />
   </xsl:stream>
</xsl:template>

As written, this is not streamable and MUST raise an error (but a processor MUST also provide a commandline option to try to stream it anyway, but see my previous mail).

Currently, the only way to counter this is as follows:

<xslparam name="stream" select=" 'no' " static="yes" />

<xsl:mode _streamable="{$stream}"/>

<xsl:template match="/">
   <xsl:stream href="strm.xml" use-when="$stream = 'yes' or $stream = '1' or $stream = 'true' ">
     <xsl:apply-templates />
   </xsl:stream>
  <xsl:apply-template select="doc('strm.xml')" use-when="$stream = 'no' or $stream = '0' or $stream = 'false' " />
</xsl:template>

This is not only counter-intuitive, it is simply very hard to do this everywhere in your stylesheet and clutters your code tremendously. If you could instead write this:

<xsl:param name="stream" select=" 'no' " static="yes" />

<xsl:mode _streamable="{$stream}"/>

<xsl:template match="/">
   <xsl:stream href="strm.xml" _streamable="{$stream}">
     <xsl:apply-templates />
   </xsl:stream>
</xsl:template>

You can now use the API or commandline interface of your processor to control streaming.



> -----Original Message-----
> From: C. M. Sperberg-McQueen [mailto:cmsmcq@blackmesatech.com]
> Sent: Thursday, June 16, 2016 6:13 AM
> To: XSL Working Group
> Cc: C. M. Sperberg-McQueen
> Subject: about bug 29472 and turning streaming on and off
> 
> > ACTION 2016-06-02-001 (bug 29472) on MSMcQ to put his views on this
> > in writing, in email or in the bugzilla entry.
> 
> During the XSLT WG call of 2 June I was asked to try to put my
> thoughts on bug 29472 in writing, partly because I seem to be coming
> at things from a very different angle.  The proposal to add a
> 'streamable' attribute to xsl:stream, with values 'yes' and 'no',
> feels to me like the wrong direction, and this is my attempt to
> explain why.  I apologize for its length.
> 
> A quick summary:
> 
> In general, control over optimization, execution speed, and other
> properties orthogonal to the meaning of a program or object seem to me
> to belong outside the object (program, stylesheet), not inside it.
> Since streaming is an optimization orthogonal to the meaning (I/O
> mapping) of a stylesheet, the right place for controls that say "yes"
> or "no" to streaming processing for particular inputs seems to me to
> be command-line options or invocation-time options in an API, not the
> stylesheet itself.  This appears to mean I am out of sympathy with the
> desire at the heart of 29472, to be able to turn streaming processing
> on or off in implementation-independent ways.  If we do accede to the
> wish fo such controls, I hope we can make them look and feel like the
> pragmas or processing instructions they are, and not like declarative
> information about the stylesheet.  Conflating the proposition "this
> construct is in the guaranteed-streamable subset of XSLT" and the
> request "please process this construct in a streaming way" would
> conflate two different bits of information of quite different kinds.
> 
> 
> The details:
> 
> I.  Design issues
> 
> I believe the issue raised touches on several basic design principles
> which I believe have governed our work on streamability; some of these
> principle have I believe been explicitly discussed as guiding
> principles in the working group, while others may not have been (and
> might not have commanded agreement if they had been).
> 
> 
> 1 Streamability is an optimization.
> 
> The goal of writing a stylesheet using the guaranteed-streamable
> subset of XSLT and the goal of declaring things streamable in such a
> stylesheet is to enable the stylesheet to be evaluated in a way that
> limits one particular cost: storage consumption. The language of
> guaranteed-streamable constructs is a subset, not an extension, of
> XSLT.

There is a fundamental mistake in this reasoning: a stylesheet is not guaranteed streamable. Nor is a module, or a package. A construct is.

One stylesheet can contain both guaranteed streaming and non-streaming code. If the API is the sole way of distinguishing these parts (think of xsl:mode, xsl:accumulator, xsl:template, xsl:stream, xsl:attribute-set) then it becomes very, very complex to define what part is streamable and what is not. Hence we decided at some point to *declaratively* state that a given construct SHOULD BE guaranteed streamable. 

This also serves another purpose: compatibility between processors. If we do not have these declarations, how can you ever determine, as a user, that two constructs will be processed in a streamable way by two compatible processors?

And then there is the implementation problem: if you precompile a stylesheet, a processor MUST know beforehand that a construct is streamable, because the way such a construct is compiled is fundamentally different. In fact (but I don't know about Saxon), it changes the way the XDM is build. As such, this information must be declaratively present to allow a processor to make the right decisions.

>
>
> 2 Optimizations do not affect the meaning of a stylesheet.
> 
> Like other optimizations, streamability affects characteristics of the
> execution process which lie outside the explicitly defined meaning of
> the stylesheet (which I take to be, formally, a mapping from inputs to
> outputs).

Streamability is more than just an optimization. Some streams, like continuous streams, can never be processed by a non-streaming processor. And while they have a mapping input vs output in a given point in time, they do not have a stable mapping, that is, if inside one execution you call xsl:stream twice, you get different results with the same input. A non-streaming processor cannot do that.

In part this is also true for non-continuous streams. Streams are non-deterministic and therefore more than just an optimization.

> 3 Non-streaming processors are expected to implement all the
> constructs introduced in XSLT 3.0 for the sake of streaming.
> 
> This was explicitly discussed when one expected implementor said he
> did not plan to implement some construct or other (accumulators?
> forks?) because his would be a non-streaming processor. My
> recollection is that we explicitly agreed that all conforming
> processors were required to handle all constructs.

Yes, non-streaming processors MUST implement everything, but they are allowed to break based on other factors (size of input, continuous streams).


> 4 Streaming and non-streaming processors should never produce
> different results for a given stylesheet on a given input, except in
> areas documents as implementation-defined or -dependent.
> 
> N.B. I am using "result" here to mean "output produced by a stylesheet
> evaluation that runs to completion without errors or exceptions"; we
> expect non-streaming processors to fail on some inputs which streaming
> processor can handle, but if neither processor fails, we expect the
> same results.

Not quite true. Two streaming processors MUST provide the same output if run with the same variables (time, input stream bytes etc), but a non-streaming processor does not have to, because it cannot in some scenarios.



> II. What am I looking for?  Where am I coming from?
> 
> Now to discharge my action by describing what I think makes sense for
> mechanisms to control streaming.
> 
> a) First, I should say explicitly that I share the expectation that in
> developing a complex stylesheet it will be very helpful to be able to
> turn streaming off.
> 
> Just as people check the behavior of stylesheets (especially but not
> only buggy stylesheets) by running them with different processors, I
> would like to be able to check the behavior of a streaming stylesheet
> I'm writing by running it with a non-streaming processor. This is one
> reason I value points 3 and 4 above.
> 
> If a processor I'm using comes with both a streaming mode of operation
> and a non-streaming mode, I expect I will want to test problematic
> code in both modes.

Exactly. But on a per-construct basis. Not the stylesheet as a whole.


> b) Since it seems it may be a minority view, I should also say
> explicitly that I expect my foreseeable user requirements in this area
> to be met with a global streaming vs non-streaming switch. That is, I
> don't foresee an urgent requirement on my part to turn off streaming
> on one particular stream while keeping streaming turned on for all
> other streams.  I may be wrong.

This in itself can be a processor-switch: to "switch" the whole processor into a non-streaming processor.

But streaming is a property of a construct, not of a stylesheet or a package as a whole. And an precompiled package may be compiled in such a way that it can only be run with processors that support the streamability feature.

Conversely, we can already switch it on/off for *all* constructs that support streaming, except one, xsl:stream. It is a matter of orthogonality and usability to allow users to switch it on/off on xsl:stream as well.

> 
> c) In other contexts (e.g. C compilation, Java execution, image
> compression, database queries), I am accustomed to control the
> aggressiveness of optimization and various aspects of resource usage
> with command-line (invocation-time) switches.

I disagree with the notion that streaming is merely an optimization. It affects the code as a whole. The optimizations you speak of are of the kind of removing debugging information, tail-call recursion to loop optimization, function inlining etc. They are of a very different kind.

In C you cannot use an optimization switch to control whether or not the bodies of functions should be inlined. In fact, for that they have the "inline" keyword. In F#, you can force a declaration to be tail-called optimized, in the code. You can then have a debug build (without the TCO) and a non-debug build with the TCO. 

But in both cases, the programmer must write it in his code correctly and must inform the processor from his code that he wants the compiler to assess that a certain construct is inlineable or TCO'able. 

> 
> My mental model of the 'right' way to control optimization, resource
> usage, and other things which are orthogonal to the correctness of a
> process is shaped by examples like these:
> 
>   - the -O0, -O1, -O2, and similar flags of gcc;

These optimizations do not control resource optimization, which in these languages must be written by the programmer explicitly with delete/new and malloc for instance. 

> 
>   - the Java options for controlling memory and garbage collection
>     (-Xmx, -Xms, etc., etc.);

Again, these do not control memory consumption of the code, they *limit* the possible memory consumption of the code. The programmer still has to write the code in such a way that it fits in the memory.

In XSLT this is not different: the programmer must program the code in such a way that it does not consume much memory. And then we added a way to allow the programmer to control on a per-construct basis whether it should consume all, or very limited memory. It is still up to the processor how much "very limited" means. But the programmer, and *only* the programmer knows when the processor should be allowed to "consume all", or only a part of the input stream.

I don't see a way to control this from the commandline.

> 
>   - the run-time options on gzip and other compression software which
>     control the compression method used and the tradeoff between
>     compression time and compressed size;

I don't see the analogy.

> 
>   - the index construction statements of SQL.

In this case, the language SQL has this as part of the language. Just like XSLT has it as part of the language to control whether or not you stream. Or am I missing your point here?

> 
> All of these affect things like speed (of the compiler, of the
> resulting executable, of the compression processor, of INSERT and
> UPDATE statements, of SELECT statements) or size (of executable, of
> virtual machine, of compressed output, of database on disk), but none
> of them change the meaning of the C program, Java program, image or
> other file being compressed, or SQL INSERT / UPDATE / SELECT
> statement.

Yes, but in all but a few examples above, the programmer controls this *through the language*, not with compiler switches. Sometimes, a compiler switch adds something extra (remove debugging information, automatic TCO), but we have that in XSLT as well, remove xsl:assert, or add the execution plan. 

But I think we are closer to SQL, where you can use the language itself to create indexes, intermediate tables, in-memory tables, temp tables etc. All by using standard SQL alone and not command-line switches on firing up the database. 

> 
> There are plenty of examples in computing history of cases where the
> meaning of a formal language is not separated in this way from
> properties like memory usage and speed: C itself is one, in many
> areas; database management systems which provide different syntaxes
> for search depending on whether a given field is indexed are another.
> There is at least one XQuery engine in which a given expression will
> have two very different meanings (the apparent meaning, or the empty
> sequence) depending on whether one has built a Lucene index of the
> database.  My mental model says that those are mostly good examples of
> why I want XSLT to have a clean separation of meaning from properties
> like size and speed.

We have that clean separation, I think. With only a handful exceptions. Since XSLT 1.0 we have xsl:key, which is an optimization. In XSLT 3.0 we added cacheability of functions, because we found that a processor cannot always detect this. Similarly we added rollback="yes|no", which can also be considered an optimization, but again, this cannot be detected by theh processor.

Streaming is not a property of size and speed. It is a fundamentally different way of treating the input tree and certain operations do not apply to it (preceding-sibling). To be able to assess that and to allow users control over it, forces us to declare this as a property of a construct. 

> 
> 
> d) By analogy, my instinct is to say that the right way to handle a
> switch to control whether streaming is undertaken or not is with an
> invocation-time switch, on the command-line or as an option passed to
> an API.  Or, more generally, something wholly outside the stylesheet
> itself.

Again, this is not possible, I think. How would you devise such an API? Streamability is a property of a construct, not of a stylesheet.

> 
> This may mean that I disagree with the premise of bug 29472, for which
> the initial description begins:
> 
>     ... we assessed that it was very desirable to have the possibility
>     to switch OFF streaming for xsl:stream. The current means to do so
>     are cumbersome to do implementation-independent way, or are at API
>     level.
> 
> I expect command-line options to be implementation-dependent or to be
> at the API level.  I no more expect an implementation-independent way
> to control streaming than I expect an implementation-independent way
> to ask a C compiler for aggressive optimization or for none at all.

It is not an optimization. And xsl:stream is simply the only way to have an implementation-independent way of saying "I want to stream this". Using the principle mode is another way, but then you cannot control the reading of the input tree. With xsl:stream you can, it is the counterpart of fn:doc. 

We have tried it in an earlier draft to do it the way you suggest (roughly 4 yrs ago I think), by finding a way to make fn:doc streamable. We didn't succeed. So we went on to use the xsl:stream way as only way to read auxiliary documents in a streamable way. Of the eight-or-so constructs that can switch streaming on and off, xsl:stream is the only one that cannot switch it on and off.  

> 
> In some styles of command-line options, options like --stream vs
> --nostream or --streaming=yes|no would be natural; in others, they
> would look different.

Streaming is not a property of the stylesheet. This would not work.

> 
> If I were specifying command-line options for a processor that wanted
> to allow individual xsl:stream instructions to be processed in with
> different values of streaming-mode (yes vs no), my first sketch would
> be: (a) assign an xml:id to every stream you wish to control in this
> way; (b) use the --streaming=ID or --nostreaming=ID options
> (repeatable) to turn streaming mode on or off on individual xsl:stream
> instructions.  I mention this not because I think any implementors
> have anything to learn from me in this regard but because it may help
> other WG members understand where I am coming from (i.e. just how
> benighted I may be, from their point of view).

I think the wrong line of reasoning here is the assumption that xsl:stream is the only way to start streaming analysis. It is not. The streamable="yes|no" attribute is available on a myriad of instructions and declarations.

> 
> 
> e) In general, if streaming is orthogonal to the meaning of the
> stylesheet (point 2 above), it seems to follow that nothing in the
> stylesheet itself can affect the choice of streaming mode or
> non-streaming mode.

No, it is not orthogonal to the meaning of a stylesheet. A stylesheet is not streamable. Some parts of it may be.


> f) If point 2 is accepted not just as describing a state of affairs
> but as enunciating a design principle, then it seems to follow that
> nothing in the stylesheet should be *allowed* to affect a processor's
> choice of streaming mode or non-streaming mode.

If this were true, how could a processor ever detect which of the following parts should be processed using streaming?

A)
<xsl:template match="foo[bar]">do something</xsl:template>

B)
<xsl:template match="foo[@bar]">do something</xsl:template>

Without the fine control by the programmer to say what should be processed by using streaming or not, there is no way a processor can decide which of the above should be processed using streaming. And it MUST know this in advance, at compile time, NOT at prime time (when the stream is made available).

Even if it is capable of streaming the first example, it still isn't enough, because it will be compiled differently based on knowing whether it must be streamed or not. So if the argument is "compile anything that can be made streamable as streaming, then do so", this still won't cut it.

And another argument: streaming influences performance. So the user should be aware that often streaming is slower than normal processing. As such, he will need fine control.


> 
> This explains why the proposal in comment 3 troubles me.
> 
> 
> g) There is, however, a counter-argument.
> 
> It is not a universal truth that no program in a well-designed
> language ever contains anything that affects optimization or other
> non-semantic properties of the program.  There is a long history of
> using pragmas, sometimes in the form of magic comments, to control the
> behavior of compilers (including sometimes controlling the level of
> optimization to be attempted).  And our sister language XQuery has a
> well-developed system of pragmas and function annotations which
> appears to work well for its intended purposes.
> 
> In general, inserting pragmas to control streaming of individual
> xsl:stream constructs seems to me a poor choice for things one wants
> to change from one run to the next.  It will remind some people of the
> barbed remark in Kernighan and Pike (or Kernighan and Plauger?) about
> passing run-time parameters to the program by defining them as
> constants in the program and passing them to the program by using the
> compiler as an intermediary.
> 
> But if we do want to make it possible to control streaming processing
> from inside the stylesheet, then making something that looks and feels
> like a pragma, and not like declarative information about the
> construct, would feel to me like a better design.  XSLT doesn't have a
> lot of things that feel like pragmas, but XML does provide processing
> instructions for precisely this purpose.  A PI immediately before (as
> immediately preceding sibling of) an xsl:stream instruction, or as its
> first non-whitespace child, would feel better to me than
> streamable=yes|no with a meaning unlike that of any other attribute
> named 'streamable' in the spec.

How do you mean "unlike any other attribute named streamable"? It has *exactly* the same meaning here as with other constructs. The only difference is that its default is "yes", because the construct was introduced to help streaming.

In hindsight, I would have find it better to have xsl:doc, as a counterpart to fn:doc, which can then have an @streamable="yes|no", which would be more in line with existing constructs.

> 
> 
> III. Some other points
> 
> Some of the specific questions raised in the bug and subsequent
> discussion should probably be addressed.
> 
> A. In comment 1, MK asks
> 
> * What should a streaming/non-streaming processor do?
> 
> I think: a non-streaming processor will handle all constructs in its
> non-streaming way.  A streaming-only processor, if it could exist,
> would handle all constructs in its streaming way (I am imagining a
> streaming processor with no alternative non-streaming implementation
> of things like xsl:stream -- but such a processor cannot exist,
> because a conforming stylesheet can contain code which is not
> guaranteed streamable, applied to a stream.  A streaming processor
> will by default stream everything, and may allow the user control over
> whether to stream everything, nothing, or selected bits of the
> stylesheet.

If that were true, and even possible, we could throw roughly a third of the spec away. But I don't think this is possible. An earlier attempt tried this, it is still in a public WD (https://www.w3.org/TR/2010/WD-xslt-21-20100511), but even there, we already recognized the need of an @streamable attribute on xsl:mode. We also found, at lengthy discussions in New York (I believe) that that approach didn't work and that we needed better control for processors to be able to know what can be streamed and what not.

In an ideal world, such a thing might be possible, but then we should have started in XSLT 1.0 with STX. Or allow two versions of XSLT, one to be STX and one to be XSLT as it stands. I believe we even discussed a formal separate XPath and XSLT language, but we decided not to pursue that.

> 
> * What should they do if the code is/is-not guaranteed-streamable?
> 
> I believe that whether the code is guaranteed streamable or not, all
> conforming processors must process it according to its semantics,
> provided that they have the resources to do so.

It seems that this argument counters your previous argument where you seem to say that a streaming processor must process everything as streaming (even what currently is not considered streamable).

> 
> As a user, of course, I hope that a good streaming processor with an
> aggressive optimizer will be able to stream even things that are not
> guaranteed streamable.

Of course. And we allow that. But as with any processor-dependent behaviors, this is not guaranteed to be compatible.

> 
> * What should they do if the code is streamable but not guaranteed
> streamable?
> 
> If I ask for streaming, I hope they stream it.  If I ask for
> non-streaming processing, I hope they don't.  Since streaming
> processing is not tightly defined, non-streaming processing cannot be
> tightly define either; I do not expect to be able to do more than
> hope, one way or the other.  I do not expect to be able to argue that
> a processor is non-conforming because of the way its --stream=YES|NO
> option behaves. I do expect to be able to argue that there is a bug if
> that option causes different results in transforms that run to
> completion without error (modulo implementation-defined or -dependent
> differences).

Streaming processing *is* tightly defined. Section 18 and 19 cover this.

These sections are in place to allow users to write a stylesheet that can run on *every* processor that claims conformance with the streaming feature. If we didn't have these rules in place, we would lose compatibility. A user has, however, the option to choose to create an incompatible stylesheet that is not guaranteed streamable, but can be streamed on a given processor.

That is why we suggested to introduce a hint to the processor: tell the processor declaratively that you know that the stylesheet is not guaranteed streamable, but that you know the input stream and hence, you know it can be streamed with a given processor. It also tells another processor that it may fail doing so.

> 
> B. Comments 2, 3, 5, 6 entertain variations on a proposal to add an
> xsl:stream/@streamable attribute with 'yes' and 'no' (and possibly
> other) values.
> 
> On other elements, attributes named 'streamable' indicate that the
> construct declared is streamable (either guaranteed streamable or
> streamable in fact).  Since the use of xsl:stream already has this
> meaning, making it mean "please stream" doesn't lose information, but
> it does seem to make the design less consistent.  Making it carry the
> meaning "please stream" in all cases would, I think, be a mistake:
> "this is streamable" and "please stream" are very different sentences,
> with very different meanings.  The former has a clear declarative
> meaning; the latter is an imperative which would feel out of place in
> a declarative language like XSLT.

I think here you may have misunderstood the proposal. The meaning of xsl:stream/@streamable is proposed to have *precisely* the same meaning as with the same property in other constructs. Otherwise I don't see how it makes any sense.

<xsl:stream streamable="yes">
Means: the construct is guaranteed streamable, use streaming if you are a streaming processor (same as xsl:mode/@streamable="yes")

<xsl:stream streamable="no">
Means: the construct is not known to be guaranteed streamable, do not even assess streamability, just process as if you were a non-streaming processor (same as xsl:mode/@streamable="no")

> 
> C. In an attachment [1] to message 5 of the June archive, ABr
> summarizes what the spec says about streaming or non-streaming
> processing for various situations.  I think the core content, for me,
> now (ignoring many details of importance to other people and to me at
> other times), is
> 
>   - a streaming processor evaluating guaranteed-streamable code is
>     expected to stream
>   - any processor facing code that's not guaranteed streamable may
>     stream if it can
>   - a non-streaming processor should attempt to process everything

Yes, I agree. 

> 
> Given these principles, the simplest way to turn streaming off seems
> to me to be "tell the processor to be a non-streaming processor".
> Every streaming processor has the code necessary to be a non-streaming
> processor, I think, because it may encounter constructs it does not
> know how to stream but which it must (or wishes to?) attempt
> nevertheless.  So I continue to think that a blanket --stream=YES|NO
> is the simplest solution to turning streaming off or on.

Again, streamability is a property of a construct, not of the stylesheet. While it may make sense to have such a property to have your processor behave as if it is a non-streaming processor, it doesn't help with the case at hand.

Also, such a commandline argument may not help with precompiled packages or modules. To know whether a construct must be tested for streamability must be done at compile time (the static phase), but documents are only given at priming time or runtime. As such, a processor must know whether a construct will use streaming (and therefore, to load the relevant document(s) using streaming) before priming the stylesheet.

> 
> [1]
> http://lists.w3.org/Archives/Public/public-xsl-wg/2016Jun/att-
> 0005/Streamability_guarantees_and_invocation_rules_-_html.htm#
> 
> ABr says it's not completely clear whether a (streaming) processor can
> complain if it sees code declared streamable that's not guaranteed
> streamable.
> 
> I think checking for guaranteed streamability is a service any
> processor can offer, but since conforming stylesheets are not required
> to limit themselves to guaranteed-streamable constructs in code
> declared streamable, I think any processor must (or should) attempt to
> evaluate the code, unless the user has specified a "die when you
> encounter non-guaranteed-streamable code" option.  (Again -- an
> invocation option, not a declaration in the stylesheet.)

I don't think I asked this question: 

The spec defines what "guaranteed streamable" means for a given construct. And we already allow, at user option, to process with streaming and ignoring the error. The error is: 

[ERR XTSE3430] It is a static error if a package contains a construct that is declared to be streamable but which is not guaranteed-streamable, unless the user has indicated that the processor is to handle this situation by processing the stylesheet without streaming or by making use of processor extensions to the streamability rules where available.

What I did ask is what happens if you attempt to provide a stream on input to a construct that is not streamable. Since a stylesheet is compiled in advance with the given information, it will typically not be able to stream in such a case. We do not have an error for the inverse scenario of XTSE3430.

> 
> Again my apologies for the length of this mail.

My apologies for my lengthy answers ;)

Received on Thursday, 16 June 2016 11:56:56 UTC