C14N 2.0 review comments from Scott, Meiko and Ed

I have incorporated most of the editorial comments from Scott, Meiko and Ed. These are the ones which need further discussion.  ACTION-550  and ACTION-554

* Scott's comments 1- 14
* Ed's comments 15-19
* Meiko's comments 20-25



Scott's comments
================

1. >> Abstract: Suggest rephrasing "incorporates an update to Exclusive..." to explain that it's now a single algorithm for both. Perhaps reword first sentence to say that it's a major rewrite of both Canonical XML 1.1 and Exclusive Canonical XML 1.0.

How about this : 

Original: 
Canonical XML Version 2.0 is a major rewrite of Canonical XML Version 1.1 to address issues around performance, streaming, hardware implementation, robustness, minimizing attack surface, determining what is signed and more. It also incorporates an update to Exclusive Canonicalization, effectively a 2.0 version, as well.

Modified:
Canonical XML Version 2.0 is a major rewrite of Canonical XML Version 1.1 and Exclusive Canonical XML 1.0 to address issues around performance, streaming, hardware implementation, robustness, minimizing attack surface, determining what is signed and more. It combines inclusive and exclusive algorithm into a single algorithm, that takes the canonicalization mode as a parameter.
--------------------------------------------


2. >> Sec 1.3:  Is the last sentence still operational? It suggests XML-INFOSET is "under development" but the reference seems to be to a W3C Rec.

Original:
While these limitations are not severe, it would be possible to resolve them in a future version of XML canonicalization if, for example, a new version of XPath were created based on the XML Information Set [XML-INFOSET] currently under development at the W3C.

Modified:
Remove entire paragraph.
-----------------------------------------------

3. >> The text says what the DOM Model XML subset is, but not the streaming case. I'm actually somewhat unclear on what they formally are in any case. Is En intended to connote an actual DOM node in a DOM tree? If so, I would clarinfy that. If this isn't DOM, what is the equivalent for SAX? Is it essentially an XPath that you have to dynamically evaluate as you go?

Doesn't the term "Element node" and "Document node" signify DOM? When I say DOM, I don't strictly mean the W3C DOM, because there are certain other lightweight DOM implementations too. Maybe we should use a different terminology instead of DOM. 

For SAX, you are right you will have an XPath that you need dynamically evaluate.
-----------------------------------------------
	

4. >> Would it be clean enough (and simpler) to collapse the the xmlXAncestors parameters into a single parameter and just apply "combine" to only xml:base? Is there a need to use different rules for different attributes?
Seems like the various "modes" sort of go together given how the earlier algorithms work.

How about a new parameter "xmlAncestors" whose values can be
"inheritAll" : Simulate Canonical XML 1.0 behavior, which inherits all the attributes
"inherit" : Simulate the Canonical XML 1.1 behavior, where you inherit the inheritable attributes and combine the xml:base
"none" : Simulate Exc Canonical xML 1.0 behavior
------------------------------------------------


5. >> Regarding xsiTypeAware, I would still like to see this expanded to something at least a little more generic and just allow a list of qualified node names to treat as QName-valued. Or perhaps leave xsiTypeAware and just add a separate parameter for this, if it's important for conformance to make this one MTI but not the other. Speaking for myself, I don't know that I would want to implement prefix rewriting, but I really could use the ability to handle QNames in other places.

Separate thread discussing this.
--------------------------------------------------------


6. >> I assume the "named parameter sets" part is TBD, and we need to decide what the sets are and what the MTI options are. Do we have somebody willing to make a proposal on that? I guess I would be willing to define something I could see using in profiles I'm involved with.
--------------------------------------------------------


7. >> Forgive my ignorance (I haven't ever implemented c14n), maybe I'm overlooking the obvious...but is it necessary or even desirable to sort the inclusion list or detect children of other nodes up front? Can't that be derived on the fly to avoid more than one tree walk? e.g. do a traversal and switch "on" when an element is a hit in the hash list, pull out descendents in the list as you find them, etc.

I know we want to be abstract about implementation, but at the same time we may be getting back into the problem of naïve implementations.


Separate thread discussing this.
--------------------------------------------------------

8. >> Sec 2.4: I would consider moving this section up into section 1. It seems like motivating material for the overall package of features, and could even be supplemented by additional sections that motivate some of the other options if we're so inclined.


This section is copied almost verbatim from Exclusive Canonicalization 1.0 spec. So this is not really anything motivating 2.0.
--------------------------------------------------------


9.  >> Sec 2.5: Per my earlier comment, I think we need a reference to this section in the main processing rules to provide context.
As a general comment, I'm not sure it's helpful to distinguish Explicit/implicit here, but if we did, I think the key point is not that some DOM serializers will add them for you but that the DOM itself will not include them when the node is created. I think you're trying to say that implementations need to account for this, but if that's the case, we probably would need to reference the distinction somewhere in the processing rules, and I don't see that now. Maybe you just need to add language referring to "both explicit and implicit" in some of the later text.

Original Text:
Namespace Nodes- Process the namespace node N in the same way as an attribute node.

Modified Text:
Namespace Nodes- Take the ordered list of namespace nodes resulting from <a href="#sec-Namespace-Processing">namespace processing</a>, and process each of the namespace node <code>N</code> in the same way as an attribute node.

New text in beginning of 2.5 Namespace processing:

As part of the canonicalization process, while traversing the subtree, use the following algorithm to look at all the namespace declarations in an element, and decide which ones to output.

Original Text:
Find a list of namespace declarations that are in scope for this element <code>E</code> by looking at namespace declarations in this element and its ancestors.

Modified Text:
Find a list of namespace declarations that are in scope for this element <code>E</code> by looking at both implicit and explicit namespace declarations in this element and its ancestors.
--------------------------------------------------------


10. >>  In the sequential text, is sorting the URIs well-defined? Do we need a formal reference on that? Cue rathole...3,2,1

The sorting is independent of the prefixRewriting, even if prefixRewrite=none, the namespaces needs to be sorted by URI. Sorting was already there in Canonical XML 1.1.
In sequential mode, the prefix sequence numbers are assigned in sorted order.
-------------------------------------------------------


11. >> Silly question...do we really need the complexity of digest-based rewriting?
If we do, is there a simpler way? Maybe just hex-encode the digest octets?
Yes, it's a bit longer, but it's also faster and easier...

Separate thread on this topic
--------------------------------------------------------

12. >> Didn't exactly follow the second note about exclusive c14n and the rewriting. That doesn't seem likely given that exclusive doesn't change the fact that you only output it once for a given subtree...what's the case being worried about here?

Here is an example 

Original XML
<wsse:Security xmlns:wsu="..." xmlns:wsse="...">
   <wsse:UserName wsu:Id="i1">
   ...
   </wsse:UserName>
   <wsse:Timesstamp wsu:Id="i2">
    ...
   </wsse:Timestamp>
<wsse:Security>


Note how the "wsu" prefix declaration is present there, but is not utilized. So exclusive canonicalization 1.0 will push the declaration down into UserName and Timestamp where it is really used, i.e. the wsu declaration will be output twice, once in UserName and another in Timestamp, as shown below.
<wsse:Security xmlns:wsse="...">
   <wsse:UserName xmlns:wsu="..." wsu:Id="i1">
   ...
   </wsse:UserName>
   <wsse:Timesstamp xmlns:wsu="..." wsu:Id="i2">
    ...
   </wsse:Timestamp>
</wsse:Security>




Now observe what happens with prefix rewriting, the wsu namespace is emitted twice , but each time with a different prefix. - "n2" and "n3", as shown below

<n1:Security xmlns:n1="...">
   <n1:UserName xmlns:n2="..." n2:Id="i1">
   ...
   </n1:UserName>
   <n1:Timesstamp xmlns:n3="..." n3:Id="i2">
    ...
   </n1:Timestamp>
</n1:Security>

--------------------------------------------------------


13. >> In section 4.1, the sort process is somewhat unclear to me. It seems like it would take a full tree walk, and since I can't think how the inputs in the DOM case could be other than logical pointers to actual DOM nodes, I don't see why would need to sort them ahead of time. SAX is different, but the sorting is clearly implicit there, right?

Separate thread of on sorting.

--------------------------------------------------------


14. >> Section B.1 
Is C14N 1.x a normative reference? Probably informative, no? Same for the XPath Filter?
Section B.2?
Are URI and XMLBASE normative?


What is the criteria for normative vs informative references?

--------------------------------------------------------

=============================================
Ed's comments
=============================================



15. >> 1. Should we use the term "textual representation" rather than "physical representation" when describing XML documents?

This paragraph was copied from C14N 1.0, "physical representation" was already there. Personally I like the physical representation better, because we are really talking about regula XML representation as opposed to different kind of physical representation like EXI or Fast Infoset.
--------------------------------------------------------

16. >> 3. Re "ignoreDTD". We probably discussed this before, but would  someone remind me why we are not covering XML Schema (and other schema languages) even though XML Schema is used to define Canonical XML Version 2.0.

>> Frederick:  Perhaps we should rename this parameter to "noDocumentPreProcessing"  
to indicate no default attribute processing, no attribute normalization and no processing of entities, in a definition that is not DTD specific.


We discussed this a bit on the call, the problem is that DTD are often embedded in the XML document, whereas XML schemas are always external. This option was only for embedded DTDs. If we want this to apply to schemas, we need to carefully think through it. For example entities cannot be declared in XML Schema, they can only be declared in DTD. Also XML schemas have normalization for element values too, whereas DTDs have only normalization for attributes.  "noDocumentPreProcessing" is not completely accurate because there is at least one pre processing that needs to be done - converting CR-LF to LF.
--------------------------------------------------------



17. >> 4. The document talks about XML "parser(s)" but I believe XML processor is the preferred term in W3C documentation.

I see that we are using "DOM parser" and "Streaming parser".  These are not very precise terms. By DOM parser I mean an XML parser that keeps the XML as a tree structure in memory and lets you traverse it. I do not mean that one has to use the W3C DOM API, there are other APIs too for this tree structure. Maybe we could use a term like a "XML infoset processor" although that may be more confusion.

The term "XML processor" is sometimes used to mean a software that can read/parse XML as well as write XML. That is why I used parser. 


--------------------------------------------------------

18. >> 6. The first use of "DOM" needs to include a definition.

As I explained in the previous comment, this is not a direct reference to the W3C DOM API. So maybe we need to come up with a different term instead of DOM.


--------------------------------------------------------

19. >> 7. In section 2.5, explain for "prefixRewrite="digest"" that the value of using this option is that namespace prefixes will be identical  across documents and contexts whereas the "sequential" option may  result in different namespace prefixes in different contexts.

>> Frederick: why is this valuable, if the output of canonicalization in signature is fed into the digesting operation? I believe we may have decided against using the canonicalized form as a interchange format - so I think we need to be clear of the requirements here.


I think it good to illustrate the pros and cons of  sequential vs digest. Scott also had a comment around it. So I have added an example to illustrate it.


--------------------------------------------------------

=============================================
Meiko's comments
=============================================


20. >> Sec 2.2, "serialization": is it useful to restrict that value to two predefined Strings ("XML", "EXI")? I'd thuink there are other serializations out there (binaryXML, compressed or the like), maybe use an URI identificator instead? Just for being able to extend this in future versions...

That is good idea. Can someone propose the URIs? I will add them in
--------------------------------------------------------

21. >> Sec 2.2 (end): Are these three named parameter sets going to be the only ones? How are they referred to? Are they just listing the deviations from the defaults or must they be completed? However, I'd add one profile for "streaming-based canonicalization", setting all options to best support for these.

We need to define the named parameter sets, and yes need to define the syntax for referring them too.
--------------------------------------------------------

22. >> 2.4.2 2nd example: what is the "xml:lang="fr"" good for? the example is complex enough, I'd vote for removing as much as possible here...
Is there a rationale behind the use of empty lines in the examples? They tend to move...

I copied these examples unchanged from Exc C14N 1.0
--------------------------------------------------------


23. >> 2.5 "default namespace": does this imply the canonicalized XML to contain sth. like ## xmlns:="http://default" ## ?? That colon would irritate me!

There is no colon after xmlns
--------------------------------------------------------


24: >> 2.5 Regarding the TBD item here: if used correctly this would fend the namespace injection issue. However, it requires very careful configuration on signature creation, and the responsible configurator must be aware of the issue. This is not something we could address with reasonable default values. For all types of XPath use we know about (i.e. IncludedXPath and ExcludedXPath from XMLDSIG2.0) I'd strongly recommend requiring this explicitly.


It is not trivial to parse an XPath and determine the prefixes being used, maybe we should not make it a MUST.
--------------------------------------------------------


25. >> A.Remove Dot segments: this is about the algorithm of Section 2.7, not 2.4. Right? Do we really need this in such volume/complexity here?

I copied this verbatim from C14N 1.1
--------------------------------------------------------



Pratik

Received on Tuesday, 27 April 2010 01:55:30 UTC