Providing for journalling of intermediate document (streams) from Henry S. Thompson on 2007-05-03 (public-xml-processing-model-wg@w3.org from May 2007)

From: Henry S. Thompson <ht@inf.ed.ac.uk>
Date: Thu, 03 May 2007 13:32:24 +0100
To: public-xml-processing-model-wg@w3.org
Message-ID: <f5bwszqhyuf.fsf@hildegard.inf.ed.ac.uk>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I took an action last week [1] to consider whether there as a way to
integrate the desired functionality of Norm's proposed 'tee' component [2]
more fully into the language.

Consider the first sample pipeline from our spec:

  <p:pipeline name="fig1" xmlns:p="http://www.w3.org/2007/03/xproc">
    <p:input port="source" sequence="no"/>
    <p:input port="schemaDoc" sequence="yes"/>
    <p:output port="result" sequence="no"/>

    <p:xinclude name="s1">
      <p:input port="source">
        <p:pipe step="fig1" port="source"/>
      </p:input>
    </p:xinclude>

    <p:validate-xml-schema name="s2">
      <p:input port="schema">
        <p:pipe step="fig1" port="schemaDoc"/>
      </p:input>
    </p:validate-xml-schema>
  </p:pipeline>

Suppose I want to see the intermediate document, that is, the output
of the xinclude.

Norm's proposal would mean adding the following step in the middle:

 <p:tee>
  <p:option name="href" value="inter.xml"/>
 </p:tee>

[I note in passing that as proposed p:tee doesn't handle document
sequences, and it's not obvious how putting it inside a p:for-each
would help. . .]

[A further note -- seems likely that as defined p:tee would be
sub-optimal inside any kind of iteration (for-each or viewport),
because, presumably, each doc. through the inner pipe would overwrite
the previous one]

So, alternative proposals. . .

1) Since I at least normally think of journalling as something to capture
the _output_ of a step, we could add an optional element to the
content model for steps:

 <p:journal port="..." href="..."/>

This would give us, for the sample pipeline above

    <p:xinclude name="s1">
      <p:input port="source">
        <p:pipe step="fig1" port="source"/>
      </p:input>
      <p:journal port="result" href="inter.xml"/>
    </p:xinclude>

The issues wrt sequences still arise, but if we allowed 
p:journal at the start of a p:for-each or p:viewport, we could at
least in principle see even what's happening at the beginning:

   <p:for-each...>
    ...
    <p:journal port="current" href="inter.xml"/>
 
or the end

   <p:for-each...>
    <p:output port="result">
     ...
    </p:output>
    ...
    <p:journal port="result" href="inter.xml"/>

2) Alternatively, we could say that journalling is associated with
   pipes, and simply add an optional 'journal' attribute to p:pipe,
   e.g.

    <p:validate-xml-schema name="s2">
      <p:input port="source>
       <p:pipe step="s1" port="result" journal="inter.xml"/>
      </p:input>
      <p:input port="schema">
        <p:pipe step="fig1" port="schemaDoc"/>
      </p:input>
    </p:validate-xml-schema>

As well as adding a p:input, this would require the preceding step to
be named, if it wasn't already.

- ----------

On balance, I prefer (1), because it's lower overhead syntactically.

Whichever way we go, I think we need to bite the sequence and
iteration bullets -- I propose that we say that the semantics of
journalling include the requirement that implementations avoid
over-writing the target if at all possible, at least within a single
pipeline evaluation episode.  The way they do this is implementation
defined (and perhaps platform-dependent) -- if they have a versioning
filesystem available, they can use it.  Otherwise, a recommended
approach might be to call the first output e.g. inter.xml, the second
inter_2.xml, the third inter_3.xml, and so on.  Or we could refer to
the widely available facility of generating unique 'temporary'
filenames with a fixed component. . .

There's an even worse problem which is shared with 'store' -- what if
anything do we say about what happens if multiple pipeline evaluations
are happening at the same time?

An alternative approach would be to document p:for-each and p:viewport
as always binding a parameter/option whose name is p:i_[stepname] to
the index of the document passing through their subpipe, and
furthermore specifying that the 'href' attribute of p:journal is
treated as an attribute value template.  Then you could write e.g.

   <p:journal port="current" href="inter_{$p:i_chapters}.xml"/>

Having such a binding convention might be a good idea in any case.

ht

[1] http://www.w3.org/2007/04/26-xproc-minutes.html#action01
[2] http://lists.w3.org/Archives/Public/public-xml-processing-model-wg/2007Apr/0138.html
- -- 
 Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
                     Half-time member of W3C Team
    2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFGOdZYkjnJixAXWBoRAm4kAJ9iX2TerrXa2GH0GgVDt7rVt22EFQCcDWoz
3EoLhUz48Ir60A37PZZ4tag=
=sYXo
-----END PGP SIGNATURE-----
Received on Thursday, 3 May 2007 12:32:26 UTC