W3C home > Mailing lists > Public > public-xml-processing-model-wg@w3.org > January 2006

Requirement: Subtree Processing Requirement & Use Case

From: Alex Milowski <alex@milowski.org>
Date: Thu, 05 Jan 2006 09:06:12 -0800
Message-ID: <43BD5204.10500@milowski.org>
To: public-xml-processing-model-wg@w3.org

Requirement:

    The pipeline language must allow a user to identity a subtree
    of a document by an XPath or XPath subset that produces a
    sequence.  This sequence should then be able to be fed to
    a sub-pipeline or sequence of pipeline steps.

Use Case:

   Example Problem (Personal Story):

   I wrote this HMM baum-welch trainer and implemented logging
   of the training steps in XML.  In the end, I had a log file that
   was a 200-300MB XML document (or larger).  The required next
   step was to transform that document into a data file that R or Matlab
   could load (a plain text file).  Just running XSLT on the whole thing
   isn't realistic.  All I really needed to do was transform a
   particular element that is repeated over-and-over again in this
   large XML log file.  So, I wanted to scope the XSLT to that
   element and produce the text-transformed result on the little
   bits of the document.

   Pipeline Solution Example:

   <subtree select="training-scenario">
     <xslt src="scenario2text-xt.xsl"/>
   </subtree>

   The 'subtree' step applies the XPath expression 'training-scenario'
   in a streaming fashion to the input.  The matching info items (i.e.
   the 'training-scenario' elements) are produce as a sequence of
   little XML document infoset sets where the 'training-scenario'
   element is the document element.  When the XSLT step runs, the
   "adapter" for it caches the streaming of that infoset into a
   "DOM" so that XSLT can run on the whole document.  Since that document
   is tiny, it can process the large data XML document (of arbitrary
   size) in constant memory.

   As I understand it, this is very similar to the 'for-each' step
   in Orbeon's pipeline language.



--Alex Milowski
Received on Thursday, 5 January 2006 17:06:30 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:21:46 GMT