RE: streaming vs p:iteration-size() from Michael Sokolov on 2009-06-04 (xproc-dev@w3.org from June 2009)

From: Michael Sokolov <sokolov@ifactory.com>
Date: Thu, 4 Jun 2009 08:27:05 -0400
To: "'Norman Walsh'" <ndw@nwalsh.com>, "'XProc Dev'" <xproc-dev@w3.org>
Message-Id: <200906041215.n54CFDlS031884@hades.falutin.net>

Thanks for the clear and patient explanations.  I haven't yet read through
the entire spec very carefully, and hadn't yet gotten to the comment about
p:split-sequence.


-Mike

> -----Original Message-----
> From: xproc-dev-request@w3.org 
> [mailto:xproc-dev-request@w3.org] On Behalf Of Norman Walsh
> Sent: Thursday, June 04, 2009 6:40 AM
> To: XProc Dev
> Subject: Re: streaming vs p:iteration-size()
> 
> "Michael Sokolov" <sokolov@ifactory.com> writes:
> > It seems as if support for streaming implementations was a major 
> > consideration in the design of xproc.  I wonder if the 
> requirement to 
> > support p:iteration-size() in the context of p:for-each and 
> p:viewport 
> > isn't at odds with the ability to create a streaming implementation 
> > though.  For example, wouldn't an implementation be 
> required to count 
> > all the matches, thus parsing the entire document, before 
> processing any of them?
> 
> Yes. We've tried to design XProc so that a streaming 
> implementation is possible, but that doesn't that every 
> pipeline will stream. The same problem exists with last() in 
> ordinary XPath predicates.
> 
> We call this out explicitly in the spec in, for example,
> p:split-sequence:
> 
>   Note
> 
>   In principle, this component cannot stream because it must buffer
>   all of the input sequence in order to find the context size. In
>   practice, if the test expression does not use the last() function,
>   the implementation can stream and ignore the context size.
> 
> > I haven't looked through any implementations to see what's going on 
> > there, but this seems designed in to the spec anyway. Am I 
> missing something?
> 
> Nope. And FWIW, XML Calabash doesn't attempt to stream.
> 
> > I probably should add that the context for my question is trying to 
> > understand the best way to write a "chunker" using xproc.  This is 
> > often an early step in our pipelines: we take a very large document 
> > and break it into many small documents, abandoning document 
> structure 
> > that is no longer useful to us in order to gain efficiency 
> in querying 
> > and later processing.  Of course one would prefer to do this in a 
> > streaming fashion: typically we would write a SAX handler 
> in Java.  I 
> > think perhaps p:viewport combined with a secondary output 
> port is the 
> > approach, but I'm not sure, and wondering if that can be 
> (is) done in a memory-efficient way.
> 
> If you need to recombine the processed chunks, then 
> p:viewport is probably the easiest way. But if you just want 
> to chunk, and you can express the chunks with an XPath, you 
> can do it directly on p:input with a select expression.
> 
>                                         Be seeing you,
>                                           norm
> 
> --
> Norman Walsh <ndw@nwalsh.com> | Nothing will ever be attempted, if all
> http://nwalsh.com/            | possible objections must be first
>                               | overcome.--Dr. Johnson
> 
>

Received on Thursday, 4 June 2009 12:27:43 UTC