about bug 29472 and turning streaming on and off

[This was sent to the member-only discussion list in error;
now sending it again to the public list.  Sorry for the snafu.]


> ACTION 2016-06-02-001 (bug 29472) on MSMcQ to put his views on this
> in writing, in email or in the bugzilla entry. 

During the XSLT WG call of 2 June I was asked to try to put my
thoughts on bug 29472 in writing, partly because I seem to be coming
at things from a very different angle.  The proposal to add a
'streamable' attribute to xsl:stream, with values 'yes' and 'no',
feels to me like the wrong direction, and this is my attempt to
explain why.  I apologize for its length.

A quick summary:

In general, control over optimization, execution speed, and other
properties orthogonal to the meaning of a program or object seem to me
to belong outside the object (program, stylesheet), not inside it.
Since streaming is an optimization orthogonal to the meaning (I/O
mapping) of a stylesheet, the right place for controls that say "yes"
or "no" to streaming processing for particular inputs seems to me to
be command-line options or invocation-time options in an API, not the
stylesheet itself.  This appears to mean I am out of sympathy with the
desire at the heart of 29472, to be able to turn streaming processing
on or off in implementation-independent ways.  If we do accede to the
wish fo such controls, I hope we can make them look and feel like the
pragmas or processing instructions they are, and not like declarative
information about the stylesheet.  Conflating the proposition "this
construct is in the guaranteed-streamable subset of XSLT" and the
request "please process this construct in a streaming way" would
conflate two different bits of information of quite different kinds.


The details:

I.  Design issues

I believe the issue raised touches on several basic design principles
which I believe have governed our work on streamability; some of these
principle have I believe been explicitly discussed as guiding
principles in the working group, while others may not have been (and
might not have commanded agreement if they had been).


1 Streamability is an optimization.

The goal of writing a stylesheet using the guaranteed-streamable
subset of XSLT and the goal of declaring things streamable in such a
stylesheet is to enable the stylesheet to be evaluated in a way that
limits one particular cost: storage consumption. The language of
guaranteed-streamable constructs is a subset, not an extension, of
XSLT.


2 Optimizations do not affect the meaning of a stylesheet.

Like other optimizations, streamability affects characteristics of the
execution process which lie outside the explicitly defined meaning of
the stylesheet (which I take to be, formally, a mapping from inputs to
outputs).


3 Non-streaming processors are expected to implement all the
constructs introduced in XSLT 3.0 for the sake of streaming.

This was explicitly discussed when one expected implementor said he
did not plan to implement some construct or other (accumulators?
forks?) because his would be a non-streaming processor. My
recollection is that we explicitly agreed that all conforming
processors were required to handle all constructs.


4 Streaming and non-streaming processors should never produce
different results for a given stylesheet on a given input, except in
areas documents as implementation-defined or -dependent.

N.B. I am using "result" here to mean "output produced by a stylesheet
evaluation that runs to completion without errors or exceptions"; we
expect non-streaming processors to fail on some inputs which streaming
processor can handle, but if neither processor fails, we expect the
same results.


My apologies if the foregoing is just a restatement of the obvious.


II. What am I looking for?  Where am I coming from?

Now to discharge my action by describing what I think makes sense for
mechanisms to control streaming.

a) First, I should say explicitly that I share the expectation that in
developing a complex stylesheet it will be very helpful to be able to
turn streaming off.

Just as people check the behavior of stylesheets (especially but not
only buggy stylesheets) by running them with different processors, I
would like to be able to check the behavior of a streaming stylesheet
I'm writing by running it with a non-streaming processor. This is one
reason I value points 3 and 4 above.

If a processor I'm using comes with both a streaming mode of operation
and a non-streaming mode, I expect I will want to test problematic
code in both modes.

b) Since it seems it may be a minority view, I should also say
explicitly that I expect my foreseeable user requirements in this area
to be met with a global streaming vs non-streaming switch. That is, I
don't foresee an urgent requirement on my part to turn off streaming
on one particular stream while keeping streaming turned on for all
other streams.  I may be wrong.

c) In other contexts (e.g. C compilation, Java execution, image
compression, database queries), I am accustomed to control the
aggressiveness of optimization and various aspects of resource usage
with command-line (invocation-time) switches.

My mental model of the 'right' way to control optimization, resource
usage, and other things which are orthogonal to the correctness of a
process is shaped by examples like these:

 - the -O0, -O1, -O2, and similar flags of gcc;

 - the Java options for controlling memory and garbage collection
   (-Xmx, -Xms, etc., etc.);

 - the run-time options on gzip and other compression software which
   control the compression method used and the tradeoff between
   compression time and compressed size;

 - the index construction statements of SQL.

All of these affect things like speed (of the compiler, of the
resulting executable, of the compression processor, of INSERT and
UPDATE statements, of SELECT statements) or size (of executable, of
virtual machine, of compressed output, of database on disk), but none
of them change the meaning of the C program, Java program, image or
other file being compressed, or SQL INSERT / UPDATE / SELECT
statement.

There are plenty of examples in computing history of cases where the
meaning of a formal language is not separated in this way from
properties like memory usage and speed: C itself is one, in many
areas; database management systems which provide different syntaxes
for search depending on whether a given field is indexed are another.
There is at least one XQuery engine in which a given expression will
have two very different meanings (the apparent meaning, or the empty
sequence) depending on whether one has built a Lucene index of the
database.  My mental model says that those are mostly good examples of
why I want XSLT to have a clean separation of meaning from properties
like size and speed.


d) By analogy, my instinct is to say that the right way to handle a
switch to control whether streaming is undertaken or not is with an
invocation-time switch, on the command-line or as an option passed to
an API.  Or, more generally, something wholly outside the stylesheet
itself.

This may mean that I disagree with the premise of bug 29472, for which
the initial description begins:

   ... we assessed that it was very desirable to have the possibility
   to switch OFF streaming for xsl:stream. The current means to do so
   are cumbersome to do implementation-independent way, or are at API
   level.

I expect command-line options to be implementation-dependent or to be
at the API level.  I no more expect an implementation-independent way
to control streaming than I expect an implementation-independent way
to ask a C compiler for aggressive optimization or for none at all.

In some styles of command-line options, options like --stream vs
--nostream or --streaming=yes|no would be natural; in others, they
would look different.

If I were specifying command-line options for a processor that wanted
to allow individual xsl:stream instructions to be processed in with
different values of streaming-mode (yes vs no), my first sketch would
be: (a) assign an xml:id to every stream you wish to control in this
way; (b) use the --streaming=ID or --nostreaming=ID options
(repeatable) to turn streaming mode on or off on individual xsl:stream
instructions.  I mention this not because I think any implementors
have anything to learn from me in this regard but because it may help
other WG members understand where I am coming from (i.e. just how
benighted I may be, from their point of view).


e) In general, if streaming is orthogonal to the meaning of the
stylesheet (point 2 above), it seems to follow that nothing in the
stylesheet itself can affect the choice of streaming mode or
non-streaming mode.  


f) If point 2 is accepted not just as describing a state of affairs
but as enunciating a design principle, then it seems to follow that
nothing in the stylesheet should be *allowed* to affect a processor's
choice of streaming mode or non-streaming mode.  

This explains why the proposal in comment 3 troubles me.


g) There is, however, a counter-argument.

It is not a universal truth that no program in a well-designed
language ever contains anything that affects optimization or other
non-semantic properties of the program.  There is a long history of
using pragmas, sometimes in the form of magic comments, to control the
behavior of compilers (including sometimes controlling the level of
optimization to be attempted).  And our sister language XQuery has a
well-developed system of pragmas and function annotations which
appears to work well for its intended purposes.

In general, inserting pragmas to control streaming of individual
xsl:stream constructs seems to me a poor choice for things one wants
to change from one run to the next.  It will remind some people of the
barbed remark in Kernighan and Pike (or Kernighan and Plauger?) about
passing run-time parameters to the program by defining them as
constants in the program and passing them to the program by using the
compiler as an intermediary.

But if we do want to make it possible to control streaming processing
from inside the stylesheet, then making something that looks and feels
like a pragma, and not like declarative information about the
construct, would feel to me like a better design.  XSLT doesn't have a
lot of things that feel like pragmas, but XML does provide processing
instructions for precisely this purpose.  A PI immediately before (as
immediately preceding sibling of) an xsl:stream instruction, or as its
first non-whitespace child, would feel better to me than
streamable=yes|no with a meaning unlike that of any other attribute
named 'streamable' in the spec.


III. Some other points

Some of the specific questions raised in the bug and subsequent
discussion should probably be addressed.

A. In comment 1, MK asks

* What should a streaming/non-streaming processor do?

I think: a non-streaming processor will handle all constructs in its
non-streaming way.  A streaming-only processor, if it could exist,
would handle all constructs in its streaming way (I am imagining a
streaming processor with no alternative non-streaming implementation
of things like xsl:stream -- but such a processor cannot exist,
because a conforming stylesheet can contain code which is not
guaranteed streamable, applied to a stream.  A streaming processor
will by default stream everything, and may allow the user control over
whether to stream everything, nothing, or selected bits of the
stylesheet.

* What should they do if the code is/is-not guaranteed-streamable?

I believe that whether the code is guaranteed streamable or not, all
conforming processors must process it according to its semantics,
provided that they have the resources to do so.  

As a user, of course, I hope that a good streaming processor with an
aggressive optimizer will be able to stream even things that are not
guaranteed streamable.

* What should they do if the code is streamable but not guaranteed
streamable?

If I ask for streaming, I hope they stream it.  If I ask for
non-streaming processing, I hope they don't.  Since streaming
processing is not tightly defined, non-streaming processing cannot be
tightly define either; I do not expect to be able to do more than
hope, one way or the other.  I do not expect to be able to argue that
a processor is non-conforming because of the way its --stream=YES|NO
option behaves. I do expect to be able to argue that there is a bug if
that option causes different results in transforms that run to
completion without error (modulo implementation-defined or -dependent
differences).

B. Comments 2, 3, 5, 6 entertain variations on a proposal to add an
xsl:stream/@streamable attribute with 'yes' and 'no' (and possibly
other) values.

On other elements, attributes named 'streamable' indicate that the
construct declared is streamable (either guaranteed streamable or
streamable in fact).  Since the use of xsl:stream already has this
meaning, making it mean "please stream" doesn't lose information, but
it does seem to make the design less consistent.  Making it carry the
meaning "please stream" in all cases would, I think, be a mistake:
"this is streamable" and "please stream" are very different sentences,
with very different meanings.  The former has a clear declarative
meaning; the latter is an imperative which would feel out of place in
a declarative language like XSLT.

C. In an attachment [1] to message 5 of the June archive, ABr
summarizes what the spec says about streaming or non-streaming
processing for various situations.  I think the core content, for me,
now (ignoring many details of importance to other people and to me at
other times), is 

 - a streaming processor evaluating guaranteed-streamable code is
   expected to stream
 - any processor facing code that's not guaranteed streamable may
   stream if it can
 - a non-streaming processor should attempt to process everything

Given these principles, the simplest way to turn streaming off seems
to me to be "tell the processor to be a non-streaming processor".
Every streaming processor has the code necessary to be a non-streaming
processor, I think, because it may encounter constructs it does not
know how to stream but which it must (or wishes to?) attempt
nevertheless.  So I continue to think that a blanket --stream=YES|NO
is the simplest solution to turning streaming off or on.

[1]
http://lists.w3.org/Archives/Public/public-xsl-wg/2016Jun/att-0005/Streamability_guarantees_and_invocation_rules_-_html.htm#

ABr says it's not completely clear whether a (streaming) processor can
complain if it sees code declared streamable that's not guaranteed
streamable.

I think checking for guaranteed streamability is a service any
processor can offer, but since conforming stylesheets are not required
to limit themselves to guaranteed-streamable constructs in code
declared streamable, I think any processor must (or should) attempt to
evaluate the code, unless the user has specified a "die when you
encounter non-guaranteed-streamable code" option.  (Again -- an
invocation option, not a declaration in the stylesheet.)

Again my apologies for the length of this mail.


-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************

Received on Thursday, 16 June 2016 17:27:31 UTC