A "processing model" proposal from Norman Walsh on 2006-02-16 (public-xml-processing-model-wg@w3.org from February 2006)

From: Norman Walsh <Norman.Walsh@Sun.COM>
Date: Thu, 16 Feb 2006 09:48:33 -0500
To: public-xml-processing-model-wg@w3.org
Message-ID: <87r763z7em.fsf@nwalsh.com>
I've been trying to think of a way to simplify our underlying
processing model. I'd like to avoid the whole notion of dependencies
and backwards/forwards chaining, etc., if possible.

Here's my current idea.

Imagine that we define a component in the system call the "pool
manager". The pool manager's job is to provide infosets. You hand it a
URI and it returns an XML infoset. We say nothing about how it builds
the infoset; if it turns CSV files into XML, more power to it. All
components get their infosets from the pool manager.

The pool manager has one distinguished infoset, the anonymous infoset.
The initial value of this anonymous infoset is implementation
dependent. Each component can consume the anonymous infoset and
produce (exactly) one new anonymous infoset, which will become the
pool manager's anonymous infoset after it finishes. The anonymous
infoset acts like stdin/stdout, basically.

When the pipeline finishes, what the pool manager does with the
anonymous infoset its left with is implementation dependent.

Components can naturally consume and produce other things as well, but
all but one of them must be named with URIs. The components are
responsible for notifying the pool manager about any new infosets that
they create (if those infosets are expected to be available for
subsequent processing, which I imagine to be the normal case).

Steps in the pipeline are processed in document order. If we allow
sub-pipelines, we'll have to talk about the nature of processing in
that case. And if we allow iteration or conditional processing, we'll
have to outline that too. But the basic idea is document order. If a
processor is smart and can work out better arrangements, fine, but the
results must be as if the stages had been processed in document order.

So here's a valid pipeline:

  <p:pipeline>
    <p:stage name="validate"/>
    <p:stage name="xinclude"/>
    <p:stage name="validate"/>
  </p:pipeline>

Validation takes the anonymous infoset and produces a new one. Ditto
XInclude. So this pipeline performs validation, xinclude, and validation
on the anonymous infoset and produces an anonymous infoset.

Here's another pipeline:

  <p:pipeline>
    <p:stage name="validate">
      <p:input href="someURI"/>
    </p:stage>
    <p:stage name="xinclude"/>
    <p:stage name="validate"/>
  </p:pipeline>

It doesn't consume the intial input, it starts with someURI.

And another

  <p:pipeline>
    <p:stage name="validate"/>
    <p:stage name="xinclude"/>
    <p:stage name="validate">
      <p:output href="someOtherURI"/>
    <p:stage>
    <p:stage name="xslt">
      <p:input href="someOtherURI"/>
      <p:param name="stylesheet" href="style.xsl"/>
    </p:stage>
  </p:pipeline>

This pipeline performs validation, xinclude, and validation then
transforms the result. A clever processor could do the last two steps
in parallel but it doesn't have to.

This pipeline (probably) fails:

  <p:pipeline>
    <p:stage name="validate">
      <p:output href="someURI"/>
    </p:stage>
    <p:stage name="xinclude"/>
  </p:pipeline>

When the XInclude stage begins, there's no anonymous infoset to
consume which is probably an error. We could say that the original
anonymous infoset is still available, I suppose. That is, that
consumption isn't desctructive. I dunno though.

Finally, if we allowed recursion, you could do things like this:

  <p:pipeline>
    <p:stage name="validate"/>
    <p:stage name="xslt">
      <p:input href="someURI"/>
      <p:param name="stylesheet">
        <p:pipeline>
          <p:stage name="xinclude"/>
          <p:stage name="validate"/>
        </p:pipeline>
      <p:param>
    </p:stage>
  </p:pipeline>

This has the somewhat odd consequence of using the anonymous infoset,
initially validated, then xincluded and validated again, as the
stylesheet instead of the input.

Anyway, assuming I haven't overlooked 11 different things, this model
seems to result in fairly straightforward pipelines in the simple
case, it leverages the common idiom of stdin/stdout, it allows
arbitrarily complex pipelines, I think, and it can be optimized both
statically and dynamically.

Oh, and one last thing, this would also be valid:

  <p:pipeline>
    <p:stage name="validate"/>
    <p:stage name="xslt">
      <p:param name="stylesheet" href="style.xsl"/>
    </p:stage>
    <p:stage name="fo-processor">
      <p:output href="somefile.pdf"/>
    </p:stage>
  </p:pipeline>

That is, there's nothing that prevents a stage from producing non-XML.
Note, however, that it can't do this anonymously. What flows through
the pipeline is strictly XML. But there's nothing that prevents stages
From writing and reading non-XML from other URIs if they wish.

Thoughts?

                                        Be seeing you,
                                          norm

-- 
Norman.Walsh@Sun.COM / XML Standards Architect / Sun Microsystems, Inc.
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
Received on Thursday, 16 February 2006 14:49:06 UTC