Re: result documents in an XSLT step? from David Birnbaum on 2020-11-01 (xproc-dev@w3.org from November 2020)

From: David Birnbaum <djbpitt@gmail.com>
Date: Sun, 1 Nov 2020 10:13:35 -0500
To: Geert Bormans <geert@gbormans.telenet.be>
Cc: XProc Dev <xproc-dev@w3.org>
Message-ID: <CAP4v81oS8dMeejwjrcHKvGOyty_vrpnWvyByLdDc2U1=MoALFw@mail.gmail.com>
Dear Geert (cc xproc-dev),

I fear that I may be expecting XProc to behave like XSLT in situations
where that's a faulty assumption, but when I try the following, it raises
an error (details below). I have revised the input XML so that the distinct
"paradigm" values are now child <paradigm> elements of <item> elements (in
my earlier posting they were attributes), and I have upgraded Morgana to
0.9.4.8 and Saxon EE to 10.1 (which is the most recent version currently
supported by Morgana).

My attempt at <p:for-each> is:

  <p:for-each name="loop">
    <p:with-input select="distinct-values(//paradigm)">
      <p:pipe step="normalize" port="result"/>
    </p:with-input>
    <p:variable name="current-paradigm" as="xs:string" select="."/>
    <p:filter name="filtering" select="descendant::item[paradigm eq
$current-paradigm]">
      <p:with-input port="source">
        <p:pipe step="normalize" port="result"/>
      </p:with-input>
    </p:filter>
    <p:xslt name="generate">
      <p:with-input port="stylesheet" href="verb-generate.xsl"/>
    </p:xslt>
  </p:for-each>

This raises an error on the filter line: "$current-paradigm is not declared
or not visible in this context" (XD0016).

I had thought that this would work because I expected that the for-each
step would operate on each distinct value of <paradigm>, in turn; that the
variable $current-paradigm would be set to that value on each pass through
the loop; that the filter step would then have access to the variable
value; and that the XSLT step would then operate on the result of the
filter step. (At the moment I am not trying storing the result; I expected
to see it on stdout.) My assumption about how to filter inside a for-each
step is obviously wrong, but I don't understand why. The "normalize" step
is an XSLT step that outputs the XML to be filtered and transformed.

As an ancillary issue, when I change the filter to hard-code a specific
paradigm, I get a Java heap error, and cranking the Java memory up to 16G
(the machine has 32G) with -Xmx16G on the "java" line in Morgana.sh doesn't
help; the same error is raised.

The XProc modification is:

<p:filter name="filtering" select="descendant::item[paradigm eq '1a']">

and the error is:

[09:58:03.375] Generating verb forms
Exception in Fiber "fiber-10000016" java.lang.OutOfMemoryError: Java heap
space
at net.sf.saxon.tree.tiny.TinyTree.<init>(TinyTree.java:193)
at net.sf.saxon.tree.tiny.TinyBuilder.open(TinyBuilder.java:124)
at net.sf.saxon.event.SequenceWriter.createTree(SequenceWriter.java:103)
at net.sf.saxon.event.SequenceWriter.startDocument(SequenceWriter.java:55)
at net.sf.saxon.event.ProxyReceiver.startDocument(ProxyReceiver.java:106)
at
com.xml_project.morganaxproc3.saxon10connector.Saxon10Core.treeWalk(Saxon10Core.java:297)
at
com.xml_project.morganaxproc3.saxon10connector.Saxon10Core.convertToSaxon(Saxon10Core.java:282)
at
com.xml_project.morganaxproc3.saxon10connector.Saxon10Core.convertToSaxon(Saxon10Core.java:199)
at
com.xml_project.morganaxproc3.saxon10connector.Saxon10Stylesheet.applyTemplates(Saxon10Stylesheet.java:213)
at
com.xml_project.morganaxproc3.steplibraries.standardsteps.XSLTStep$1.run(Unknown
Source)
at
com.xml_project.morganaxproc3.steplibraries.AtomicXProcStepImplementation.perform(Unknown
Source)
at com.xml_project.mopl.steps.MoPLLibraryStep.run(Unknown Source)
at com.xml_project.mopl.runtime.LibraryStepActor.startIt(Unknown Source)
at com.xml_project.mopl.runtime.BufferingActor.checkRun(Unknown Source)
at com.xml_project.mopl.runtime.BufferingActor.doRun(Unknown Source)
at com.xml_project.mopl.runtime.BufferingActor.doRun(Unknown Source)
at co.paralleluniverse.actors.Actor.run0(Actor.java:710)
at co.paralleluniverse.actors.ActorRunner.run(ActorRunner.java:51)
at co.paralleluniverse.fibers.Fiber.run(Fiber.java:1097)

Saxon, using the default memory (that is, without any -Xmx option)
completes the transformation without error. The XSLT is non-streaming, and
I'd like to keep it that way, if possible, so that the pipeline will also
run under Saxon HE.

Best,

David

On Sat, Oct 31, 2020 at 4:30 PM David Birnbaum <djbpitt@gmail.com> wrote:

> Dear Geert (cc xproc-dev),
>
> Thank you for this suggestion! Gerrit's advice about URI expectations and
> storing result documents resolves the issues I reported, but it also feels
> more direct to do the grouping (even if indirectly, by way of filtering)
> inside XProc, since that puts the XSLT in charge only of transformation,
> and lets XProc oversee the file management details. I will try your
> filtering suggestion, as well, and report the results, probably tomorrow.
>
> Best,
>
> David
>
> On Sat, Oct 31, 2020 at 3:20 PM Geert Bormans <geert@gbormans.telenet.be>
> wrote:
>
>> Hi David,
>>
>> Have you considered doing a...
>> p:for-each on the distinct values of the item/@paradigm in the source XML
>> have a p:xslt inside the p:for-each that takes the paradigm as a filter
>> parameter (so don't group but filter)
>> and p:store the result inside the for-each
>>
>> Met vriendelijke groeten,
>> Best regards,
>>
>> Geert Bormans
>>
>> ----- Op 31 okt 2020 om 20:02 schreef David Birnbaum <djbpitt@gmail.com>:
>>
>> Dear xproc-dev,
>> I would be grateful for advice about how best to manage a pipeline that
>> requires me to generate and then continue to process multiple output
>> documents from a single input. The input contains 110k <item> elements that
>> are distinguished by a @paradigm attribute on the <item> element; there are
>> about 150 different @paradigm values in the input. I would like to group
>> the <item> elements by their @paradigm values, process each group, and
>> write the outputs for each group separately to disk. I would also like to
>> run another transformation over those outputs and write the results of that
>> transformation to disk, as well. I have poked at the following approaches
>> and run into trouble with both of them, probably because (or, at least,
>> partially because) I am not (yet, I hope!) very adept at XProc:
>>
>> 1. Within the XProc, I run an XSLT step that uses <xsl:for-each-group>
>> and <xsl:result-document> to create separate output for each group, with
>> constructed output @href values. This errors out with:
>>
>> <c:errors xmlns:c="http://www.w3.org/ns/xproc-step"><c:error
>> code="err:XC0121" name="generate" type="p:xslt"
>> href="file:///Users/djb/repos/cz/pos/verb/verb.xpl" line="64" column="27"
>> xmlns:p="http://www.w3.org/ns/xproc" xmlns:err="
>> http://www.w3.org/ns/xproc-error"><message>URI
>> '/Users/djb/repos/cz/output/verb-1a.xml' of secondary result is not valid
>> or not absolute.</message></c:error></c:errors>
>>
>> I had first tried a relative path for the @href on the
>> <xsl:result-document>, and I thought the error message meant that there was
>> no base URI within the pipeline, so I specified an absolute path
>> instead, but, as seen above, that raises the same error. I did specify a
>> secondary port in the XProc with:
>>
>> <p:output port="secondary" sequence="true"/>
>>
>> but that seems to have no effect on the outcome (perhaps I specified it
>> in the wrong place?). I think I should be able to write multiple result
>> documents, and that I have misunderstood something about how to set that
>> up. For what it's worth, I also think I may need a <p:store> step to save
>> the multiple result documents, and although I've used <p:store>
>> successfully with single outputs, I don't know what it should look like to
>> save a set of result documents. But if I've understood the error correctly,
>> I'm stalled on the XSLT step, and need to get past that first.
>>
>> 2. As an alternative to <xsl:for-each-group> inside the XSLT stylesheet,
>> I considered doing the grouping in XProc, but I don't see anything within
>> XProc comparable to <xsl:for-each-group>. If I am reading the description
>> correctly, a <p:for-each> step might let me loop over <item> elements, but
>> it does not appear to have the ability to form the <item> elements into
>> groups according to shared @paradigm values and loop over those groups. I
>> could run an XSLT pre-processing step to do the grouping, all within our
>> document, creating an intermediate hierarchical level (called, say,
>> <group>) and then use <p:for-each> to loop over those, but that extra step
>> feels to me like a hack, that is, as if there should be a more direct way
>> to do what I need. Should I ignore that feeling?
>>
>> Assuming I can get the individual result documents written to disk, I
>> think I can do the subsequent transformation with a <p:for-each> step.
>>
>> I am using MorganaXProc-IIIse 0.9.4.2-beta and Saxon EE 10.0, and running
>> from the command line under MacOS 10.15.7. Thanks in advance for any
>> pointers in The Right Direction.
>>
>> Best,
>>
>> David
>> djbpitt@gmail.com
>>
>>
Received on Sunday, 1 November 2020 15:14:01 UTC