Streaming C14N vNext requirements

The streaming requirement is not captured very well in the transform 
note document. So let me explain it here, this also answers some of 
Chris Solc's comments from last meeting.

The current canonicalization is defined in terms of a nodeset.  In Java, 
the function signature could be like this

byte[] doCanonicalize(Set<Node>)

i.e. it takes an unordered set of nodes, canonicalizes them, and 
produces an array of bytes. In the Nodeset, the nodes are not ordered in 
any way and also the nodest requires a backing DOM to know parent/child 
relationships between the Nodes. This is what makes Nodesets inherently 
unstreamable.


With streaming canonicalization, the function signature could be like this


InputStream setupCanonicalizer(XMLStreamReader)

The input is StAX XML Stream reader. StAX is a streaming XML Parser - it 
represents a document as a set of "Events" e.g. startElement, text, 
endElement etc. Attributes and namespaces are returned in the 
startElement event.  The StaX event stream is ordered and doesn't need a 
backing DOM. That is why I want to use a mechanism similar to this to 
represent the input to the canonicalizer.

The output in an InputStream, this is Java's way of representing a 
stream of bytes.

Note, this function will not actually canonicalize, it will just sets it 
up. Actual canonicalization will happen when somebody reads from the 
returned InputStream. As and when somebody reads from the InputStream, 
the canonicalizer will read from the XMLStream reader. I.e. even if this 
function is asked to canonicalize a 1MB document, it will not allocate a 
1MB array in memory, it will just require a small fixed size buffer 
internally. (assuming there is cap on the size of a single element tag)



These java functions were just to illustrate the streaming requirements, 
which are

    * Input to canonicalizer is something that can be representable as
      XML event stream
    * Output of the canonicalizer is a byte stream
    * canonicalizer should be able to do chunking, it should not be
      required to keep the entire input document in memory
    * The input to the canonicalizer should not have data that cannot be
      represented by an XML Stream, e.g. attributes without their owner
      elements cannot be represented.


Pratik

Received on Tuesday, 17 March 2009 01:39:22 UTC