Re: Comments from the XSLT WG on the XProc Last Call Document from Nikolay Fiykov on 2007-10-26 (public-xml-processing-model-comments@w3.org from October 2007)

From: Nikolay Fiykov <nikolay.fiykov@nsn.com>
Date: Fri, 26 Oct 2007 17:14:09 +0300
To: jeni@jenitennison.com
CC: public-xml-processing-model-comments@w3.org, w3c-xsl-wg@w3.org
Message-ID: <4721F631.2090303@nsn.com>

Hi Jeni,

> 3. The XProc specification does not make it clear if parallel executions
> are handled. (Currently there is implicit parallelism based on connection
> between steps.)  This would be a problem for any task involving multiple
> processing steps on top of streams.
 >
 >I don't understand this point (probably someone else on the XProc WG 
will, but I'll ask anyway). Can you (or anyone) expand, perhaps >with an 
example?
 >

Currently XSL WG is working on streaming transformations, everything 
related to large or infinite
input XML documents, memory and time constraints.
 From the list of use cases we've gathered so far, several of them can 
be addressed by combined use of pipelining and transformations.

Here is one such example.

Given the input "<root> <A/> <B/> <A/> <B/> ... </root>", produce two 
output documents where each contains A or B only : "<root> <A/> <A/> ... 
</root>" and "<root> <B/> <B/> ... </root>".

This can be solved easily with two similar stylesheets filtering out A 
and B respectively.
A pipeline can be used in conjunction with XSLT to facilitate their 
execution.
For example we can use XProc with a pipeline modeled after "Example 5 A 
Sample For-Each".

The catch is that the input is so big that it cannot fit into memory. 
Also, we have to operate with the assumption
that it is readable only once i.e. it is a single pass stream feed.

Technically this can be done only if we assume that XProc's processor 
implements XML Documents as XML-events
(not DOM) and that both transformations will receive input events 
simultaneously.

Now, the spec is flexible enough about what an XML document is:
"What flows between steps are exclusively XML documents. The inputs and 
outputs can be implemented as sequences of characters, sequences of 
events, object models, or any other representation that the 
implementation chooses."

There is also a guidance as to how (essentially linear) execution should 
happen:
"The result of evaluating a pipeline is the result of evaluating the 
steps that it contains, in the order determined by the connections 
between them. A pipeline must behave as if it evaluated each step each 
time it occurs."

What the spec lacks completely though is how parallel branches are to be 
handled.
By "parallel branches" I mean the one defined by "connection between 
ports", not the conditional one.

I'd argue that this is not entirely for implementors to choose as 
essentially depending on the strategy,
we may have different end results for one and same pipeline.
In this case we can receive either both stylesheets results (if events 
are distributed simultaneously)
or only one of them (if the implementation executes each xslt as a 
separate step) (second will be
empty because the first step already consumed the stream).

Further on, there are similar questions as to how to merge results from 
parallel executions.
We have few other use cases where stream merging (combining) would be 
needed.
The spec has got nothing about such cases either.

So, did that long explanation helped for our understanding now or more 
is needed?

Cheers, Nikolai

Received on Friday, 26 October 2007 14:14:28 UTC