- From: Alex Milowski <alex@milowski.org>
- Date: Thu, 05 Jan 2006 09:06:12 -0800
- To: public-xml-processing-model-wg@w3.org
Requirement: The pipeline language must allow a user to identity a subtree of a document by an XPath or XPath subset that produces a sequence. This sequence should then be able to be fed to a sub-pipeline or sequence of pipeline steps. Use Case: Example Problem (Personal Story): I wrote this HMM baum-welch trainer and implemented logging of the training steps in XML. In the end, I had a log file that was a 200-300MB XML document (or larger). The required next step was to transform that document into a data file that R or Matlab could load (a plain text file). Just running XSLT on the whole thing isn't realistic. All I really needed to do was transform a particular element that is repeated over-and-over again in this large XML log file. So, I wanted to scope the XSLT to that element and produce the text-transformed result on the little bits of the document. Pipeline Solution Example: <subtree select="training-scenario"> <xslt src="scenario2text-xt.xsl"/> </subtree> The 'subtree' step applies the XPath expression 'training-scenario' in a streaming fashion to the input. The matching info items (i.e. the 'training-scenario' elements) are produce as a sequence of little XML document infoset sets where the 'training-scenario' element is the document element. When the XSLT step runs, the "adapter" for it caches the streaming of that infoset into a "DOM" so that XSLT can run on the whole document. Since that document is tiny, it can process the large data XML document (of arbitrary size) in constant memory. As I understand it, this is very similar to the 'for-each' step in Orbeon's pipeline language. --Alex Milowski
Received on Thursday, 5 January 2006 17:06:30 UTC