RE: Sorting of DOM input to c14n 2.0

One of the important goals of C14N 2.0 is performance, even in DOM. And for that I made sure that we don't traverse any of the nodes that we don't have to. Say you are only signing 10% of the document, it would be very inefficient if you need to traverse the whole document just to sign a small part of it.  So I didn't include depth first traversal of the whole document anywhere in the algorithm - in fact this is the reason that I did not use the algorithm mentioned in 3.1 of exclusive canonicalization http://www.w3.org/TR/xml-exc-c14n/#sec-Implementation because that  traverses the whole tree. And note that C14N 2.0 prunes the tree traversal for exclusions, so there too we don't unnecessarily iterate through nodes that are excluded.

 

 

Comparing the document order of two nodes also does not require traversing the whole document.

E.g. suppose you have two nodes E--1 and E2

.         Walk up the ancestor of each of the nodes, all the way upto the document root node R. i.e. you have { P0, P1, P2,... E1 }  as the ancestors for E1 and { Q0, Q1, Q2,... E2 } as the ancestors for E2 where P0 = Q0 = R the document root. The list of ancestors is limited by the depth of the tree, and usually XML documents have very shallow depths.

.         Now these two elements must have a common ancestor - which maybe the document root node. To find the common ancestor, compare the first part of the lists  i.e. in the list find the last Qn which is same as Pn, n may be 0 indicating that the only common ancestor is the document root.

.         Iterate through all the children of this common ancestor (Pn ) to find out which comes first Pn+1 or Qn+1 . XML documents are often very long and this common ancestor may have a thousand children and you need to iterate through all of them, but still you are not going through the entire document - just all the children of one node.

 

 

For sorting n  nodes, you need to do O(n lg n) comparisons, i.e. you would have to this above sequence of steps that many times, So if n is large it would be better to simply do a traversal of the whole tree. But I have seen in practice 99% of times one Reference signs one subtree only, it would be very strange for someone to have large number of subtrees. (One reason for signing multiple subtrees is to prevent wrapping attacks. E.g. in the SOAP case instead of using an id based reference to the soap:Body, it would be recommended to use an XPath to refer to the soap:Body, so that if someone maliciously inserts another soap:Body somewhere that would also be included in the selection thereby thwarting the attack. In this case too the number of subtrees to be signed is very small)

 

Based on this I recommend that we don't do a whole tree traversal, but instead do a sorting by document order.

 

 

 

Things are very different in streaming.

 

In DOM at first the document is parsed into a DOM tree.  Then Xpaths are evaluated to get a list of element nodes. Then canonicalization is applied to these nodes.

 

But with streaming, all these steps are "pipelined". I.e. while the document is parsed, each node is immediately sent to the XPath engine which decides whether it is in the subset or not, and sends it immediately to the c14n engine, which sends canonicalizes and sends to the digestor. All these operations - document parsing, xpath evaluation, canonicalization, digesting are happening in parallel and no point you have the complete intermediate result.  The whole goal of streaming is to be amenable like XML gateways which need to very quickly inspect and process documents as they are passing by.

 

 

 

Pratik

 

 

 

 

-----Original Message-----
From: Scott Cantor [mailto:cantor.2@osu.edu] 
Sent: Tuesday, April 13, 2010 9:15 AM
To: public-xmlsec@w3.org
Subject: Sorting of DOM input to c14n 2.0

 

A couple more thoughts on this:

 

I was mistaken about there not being a DOM function to do sorting

(compareDocumentPosition) though it's DOM L3.

 

But I also still wonder why this is needed, given that it's possible to

impose the order by doing a depth-first traversal of the document anyway

(visit each node, if it's the root of an input subset, activate c14n and

start the algorithm). No need for extra processing.

 

So I'm still not convinced we need to include sorting as an explicit part of

the DOM processing rules, rather than just requiring that the output in fact

be in document order.

 

I didn't understand the algorithm for fast sorting that Pratik was

suggesting, but as a matter of avoiding over-specifying anyway, I would

personally suggest avoiding sorting as an explicit step.

 

-- Scott

 

 

 

 

Received on Tuesday, 13 April 2010 17:38:35 UTC