Re: Transform Note Design Decisions from Pratik Datta on 2009-03-28 (public-xmlsec@w3.org from March 2009)

From: Pratik Datta <pratik.datta@oracle.com>
Date: Fri, 27 Mar 2009 19:32:39 -0700
To: Thomas Roessler <tlr@w3.org>
CC: XMLSec WG Public List <public-xmlsec@w3.org>
Message-ID: <49CD8C47.9050009@oracle.com>
I guess I hijacked your original email thread to discuss the overall 
transform issue.

The event stream part of the proposal is for the streaming requirement 
which is completely separate from the determine-what-is-signed requirement.

In Java StaX is popular streaming parser - it is embedded in JDK 1.6, 
(http://java.sun.com/javase/6/docs/api/javax/xml/stream/XMLStreamReader.html)
 and in C# The XmlTextReader class is 
(http://msdn.microsoft.com/en-us/library/system.xml.xmltextreader.read.aspx)

 From these we define a "event stream" model as follows.

    * unlike a nodeset, the entire event stream is not available all at
      once. Instead there is an "engine", and this returns the "events"
      one by one.
    * Here is an example of how an XML is split up into events. 
    * <foo a="23">|<bar>|Some|Text|</bar>|</foo>
    * Possible Events - StartDocument, EndDocument, StartElement,
      EndElement. Text, ProcessingInstruction, Comment
    * All the attributes and namespace declarations are read as part of
      the StarttElement event.
    * Large text nodes may be split up into multiple Text events.
    * The engine only knows about the event that is it currently
      pointing to  - it doesn't have any idea of the events before or
      after. 
      However it maintains a namespace context. I.e. at element nodes it
      can be queried to find out about all the namespace declarations in
      context.
    * It goes in a forward only direction.  calling "engine.next()" will
      make the engine go to the next event.
    * At every position, the engine can be queried to get the current
      event and its details.



The Canonicalization algorithm needs to defined in terms of this event 
stream  The canonicalization engine should get events one by one, and 
emit octet stream chunks for each event. This way it can work with very 
large documents, without having to keep it all in memory.

This event stream can be used to represent a complete document or 
document subset. But there are some extra considerations for document 
subsets

    * A document subset can have multiple subtrees, which translated to
      multiple root elements which is not well formed XML, but is is
      possible in this model.
    * Attributes are only valid in the context of their element, so this
      model does not allow attributes in the document subset, whose
      parent element is missing from the subset
    * A nodeset that represents a document subset always has a reference
      to the whole document. This is not the case with an event stream
      representing a document subset - in this case only the events of
      the document subset are present. So we need a solution to find
      namespaces and xml: attributes of missing ancestors.  - The
      namespaces can be obtained from the namespace context.


All the other transforms also need to be defined on top of this model. 
E.g. XPath selection needs to work on this event stream too.

Pratik

Thomas Roessler wrote:
> Hi Pratik,
>
> I agree with most of your high-level points, therefore I don't repeat 
> them here. ;-)
>
> On 25 Mar 2009, at 18:23, Pratik Datta wrote:
>
>> Thomas, regarding your nodeset question, I have been also trying to 
>> think of an different model  to represent a document subset - the 
>> event stream is a popular model in streaming parsers, but maybe we 
>> need to define our own model. 
>
> I'd like to understand whether we can use an event stream (as 
> specified where?) or whether we'd need to define a separate model.  My 
> sense is that having that framework will go a long way toward 
> understanding what your proposal means in terms of analysis and 
> implementation complexity.
>
> Therefore, if you could shed some more light on that point, that would 
> be most welcome.
>
> Thanks,
> --
> Thomas Roessler, W3C  <tlr@w3.org <mailto:tlr@w3.org>>
>
Received on Saturday, 28 March 2009 02:33:26 UTC