Re: Grounded consuming sequences, and the last() function

> On 1 Oct 2015, at 15:25, Abel Braaksma <abel.braaksma@xs4all.nl> wrote:
> 
>> 
>> We make no distinction between
>> 
>> copy-of(/a/b/c)
>> 
>> and
>> 
>> /a/b/c/copy-of()
>> 
> 
> It is true that we don't make a distinction in the *result* of the streamability rules, but they are quite different. In the first, the consuming, striding expression /a/b/c is an argument to an operand role of absorption, which grounds it. In the second, only the last part, "child::c", is an argument to an operand role of absorption.

I don’t think they are fundamentally different, only cosmetically. In both cases I think users are entitled to expect that the resulting sequence will usually be pipelined, that is, processed one “c” element at a time. Saxon’s evaluation strategy for both expressions is almost exactly the same.

> The first case requires a processor to create a copy of the whole sequence at once.

I don’t think so, for example if the expression is filtered as copy-of(/a/b/c)[price - discount = 0] then you can evaluate the predicate on each node-copy as it is constructed.
> 
> As a result, using last() in the second case has a different effect than using last() in the first case.

I don’t think so.
> 
> 
> Unless we make it a dynamic error to use last() in such cases of windowed streaming, I don't see a way around processors requiring to consume the whole stream *and* making sure that the whole stream stays in memory.

Saxon has three strategies for evaluating last(), and it decides between them fairly pragmatically:

(a) some iterators over sequences know the length of the sequence they are iterating over, e.g. an iterator over a fully-evaluated sequence in a variable or an iterator over a singleton.

(b) some iterators, when last() is requested, read ahead to discover its value and retain the results in memory to return on subsequent calls of next(). This is what’s likely to happen with both the expressions above: the sequence containing copies of all the /a/b/c nodes will be held in memory (unless memory fills).

(c) some iterators clone themselves so the work of computing the sequence is done twice, once to compute last() and once to deliver the items in the sequence. (This is what I found was happening with xsl:merge: the entire input file was read twice).

Clearly (b) is expensive in memory and (c) is expensive in time. Neither can be considered fully streamable in the sense of being able to process an infinite non-rereadable input source. So there are going to be cases that fail, if only through not having enough memory. So perhaps we ought simply to be permissive: if last() is called “while processing a grounded consuming sequence” a processor MAY report a dynamic error.

Can we define that phrase “while processing a grounded consuming sequence” a bit more precisely? I’m struggling, I can’t see how to do that. I’m inclined to the fallback (a) anything can fail at any time if you run out of resources, and (b) using last() while processing a consuming expression is a bad idea because the processor may run out of resources.

Michael Kay
Saxonica
> 
> 
> 
> 

Received on Thursday, 1 October 2015 15:25:10 UTC