Re: streaming vs p:iteration-size() from Norman Walsh on 2009-06-04 (xproc-dev@w3.org from June 2009)

From: Norman Walsh <ndw@nwalsh.com>
Date: Thu, 04 Jun 2009 06:40:25 -0400
To: XProc Dev <xproc-dev@w3.org>
Message-ID: <m2skigp9ie.fsf@nwalsh.com>

"Michael Sokolov" <sokolov@ifactory.com> writes:
> It seems as if support for streaming implementations was a major
> consideration in the design of xproc.  I wonder if the requirement to
> support p:iteration-size() in the context of p:for-each and p:viewport isn't
> at odds with the ability to create a streaming implementation though.  For
> example, wouldn't an implementation be required to count all the matches,
> thus parsing the entire document, before processing any of them?

Yes. We've tried to design XProc so that a streaming implementation is
possible, but that doesn't that every pipeline will stream. The same
problem exists with last() in ordinary XPath predicates.

We call this out explicitly in the spec in, for example,
p:split-sequence:

  Note

  In principle, this component cannot stream because it must buffer
  all of the input sequence in order to find the context size. In
  practice, if the test expression does not use the last() function,
  the implementation can stream and ignore the context size.

> I haven't looked through any implementations to see what's going on there,
> but this seems designed in to the spec anyway. Am I missing something?

Nope. And FWIW, XML Calabash doesn't attempt to stream.

> I probably should add that the context for my question is trying to
> understand the best way to write a "chunker" using xproc.  This is often an
> early step in our pipelines: we take a very large document and break it into
> many small documents, abandoning document structure that is no longer useful
> to us in order to gain efficiency in querying and later processing.  Of
> course one would prefer to do this in a streaming fashion: typically we
> would write a SAX handler in Java.  I think perhaps p:viewport combined with
> a secondary output port is the approach, but I'm not sure, and wondering if
> that can be (is) done in a memory-efficient way.

If you need to recombine the processed chunks, then p:viewport is probably
the easiest way. But if you just want to chunk, and you can express the
chunks with an XPath, you can do it directly on p:input with a select
expression.

                                        Be seeing you,
                                          norm

-- 
Norman Walsh <ndw@nwalsh.com> | Nothing will ever be attempted, if all
http://nwalsh.com/            | possible objections must be first
                              | overcome.--Dr. Johnson

Received on Thursday, 4 June 2009 10:41:11 UTC