Proposal for more efficient C14N from Pratik Datta on 2009-05-05 (public-xmlsec@w3.org from May 2009)

From: Pratik Datta <pratik.datta@oracle.com>
Date: Mon, 04 May 2009 18:17:35 -0700
To: XMLSec WG Public List <public-xmlsec@w3.org>
Message-ID: <49FF93AF.7030003@oracle.com>
Canonicalization for subtrees can be done using a simple and efficient 
algorithm whose performance almost same as plain xml serialization, and 
it produces the same output as the regular canonicalization algorithm. 
This is not really a new proposal, many implementations already do this; 
implementations go to great length to determine if this signature can be 
done in an optimized way, otherwise they fall back to the generic 
nodeset based approach. Do we really need to support the generic 
nodeset? In another email chain I listed all the possible cases 
http://lists.w3.org/Archives/Public/public-xmlsec/2009Apr/0045.html , 
cases 1-4 can be done using the subtree algorithm.

Description of the algorithm

*Data model*
Section 2.1 of the C14n spec talks about the Data Model being based on 
the XPath Nodeset.
For the subtree algorithm the data model is different -  the input is
  a) either a complete document, or a list of elements. Each element 
represents a complete subtrees, the elements are ordered in document 
order, and no element is a descendant of another element in the list.
 b) optionally a set of exclusion elements or attributes.  The exclusion 
elements are not ordered in any way

The algorithm efficiency depends on these lists being small, usually <10

*Processing Model*
The processing model is fairly simple, just do a traversal of each of 
the subtrees, prune traversal when you reach an exclusion node. For each 
node type follow the rules mentioned in Section 2.3 processing model.

For namespace nodes do this
 - At the root of the subtree look combine all the namespace 
declarations of that subtree's ancestors plus that subtree's root node, 
and emit them. Also put them into a hashmap. Now as you traverse the 
tree, whenever you come across a namespace declaration, check if it 
already in the hashmap, if it is not there then emit it, and add the 
namespace declaration to the hashmap. Also if you have added the 
declaration, remember to remove it, once you have finished traversing 
the node.

-------------------------

Why is this faster? This is because it avoids a lot of expensive things 
that the current algorithm does

   1. Does not create a large array to store a nodeset. Remember the
      Nodeset can have tens of thousands of elements, whereas the
      inclusion / exclusion list only has a handful of elements
   2. Does not sort the nodeset  - Sorting the nodeset by document order
      is expensive
   3. Does not traverse the entire document - only traverses the
      required subtrees.
   4. Does not use XPath at all.  Xpath's namespace axis causes an
      explosion of nodes.
   5. Does not use the XPath namespace axis or the ancestor axis at all,
      instead it maintains a current set of namespaces, and adds/removes
      to it, as it is traversing the document. This is much more
      efficient then looking at the ancestor axis for every element.

Pratik
Received on Tuesday, 5 May 2009 01:18:51 UTC