- From: Alex Milowski <alex@milowski.org>
- Date: Thu, 05 Jan 2006 09:06:12 -0800
- To: public-xml-processing-model-wg@w3.org
Requirement:
The pipeline language must allow a user to identity a subtree
of a document by an XPath or XPath subset that produces a
sequence. This sequence should then be able to be fed to
a sub-pipeline or sequence of pipeline steps.
Use Case:
Example Problem (Personal Story):
I wrote this HMM baum-welch trainer and implemented logging
of the training steps in XML. In the end, I had a log file that
was a 200-300MB XML document (or larger). The required next
step was to transform that document into a data file that R or Matlab
could load (a plain text file). Just running XSLT on the whole thing
isn't realistic. All I really needed to do was transform a
particular element that is repeated over-and-over again in this
large XML log file. So, I wanted to scope the XSLT to that
element and produce the text-transformed result on the little
bits of the document.
Pipeline Solution Example:
<subtree select="training-scenario">
<xslt src="scenario2text-xt.xsl"/>
</subtree>
The 'subtree' step applies the XPath expression 'training-scenario'
in a streaming fashion to the input. The matching info items (i.e.
the 'training-scenario' elements) are produce as a sequence of
little XML document infoset sets where the 'training-scenario'
element is the document element. When the XSLT step runs, the
"adapter" for it caches the streaming of that infoset into a
"DOM" so that XSLT can run on the whole document. Since that document
is tiny, it can process the large data XML document (of arbitrary
size) in constant memory.
As I understand it, this is very similar to the 'for-each' step
in Orbeon's pipeline language.
--Alex Milowski
Received on Thursday, 5 January 2006 17:06:30 UTC