- From: Pratik Datta <pratik.datta@oracle.com>
- Date: Mon, 04 May 2009 18:17:35 -0700
- To: XMLSec WG Public List <public-xmlsec@w3.org>
- Message-ID: <49FF93AF.7030003@oracle.com>
Canonicalization for subtrees can be done using a simple and efficient algorithm whose performance almost same as plain xml serialization, and it produces the same output as the regular canonicalization algorithm. This is not really a new proposal, many implementations already do this; implementations go to great length to determine if this signature can be done in an optimized way, otherwise they fall back to the generic nodeset based approach. Do we really need to support the generic nodeset? In another email chain I listed all the possible cases http://lists.w3.org/Archives/Public/public-xmlsec/2009Apr/0045.html , cases 1-4 can be done using the subtree algorithm. Description of the algorithm *Data model* Section 2.1 of the C14n spec talks about the Data Model being based on the XPath Nodeset. For the subtree algorithm the data model is different - the input is a) either a complete document, or a list of elements. Each element represents a complete subtrees, the elements are ordered in document order, and no element is a descendant of another element in the list. b) optionally a set of exclusion elements or attributes. The exclusion elements are not ordered in any way The algorithm efficiency depends on these lists being small, usually <10 *Processing Model* The processing model is fairly simple, just do a traversal of each of the subtrees, prune traversal when you reach an exclusion node. For each node type follow the rules mentioned in Section 2.3 processing model. For namespace nodes do this - At the root of the subtree look combine all the namespace declarations of that subtree's ancestors plus that subtree's root node, and emit them. Also put them into a hashmap. Now as you traverse the tree, whenever you come across a namespace declaration, check if it already in the hashmap, if it is not there then emit it, and add the namespace declaration to the hashmap. Also if you have added the declaration, remember to remove it, once you have finished traversing the node. ------------------------- Why is this faster? This is because it avoids a lot of expensive things that the current algorithm does 1. Does not create a large array to store a nodeset. Remember the Nodeset can have tens of thousands of elements, whereas the inclusion / exclusion list only has a handful of elements 2. Does not sort the nodeset - Sorting the nodeset by document order is expensive 3. Does not traverse the entire document - only traverses the required subtrees. 4. Does not use XPath at all. Xpath's namespace axis causes an explosion of nodes. 5. Does not use the XPath namespace axis or the ancestor axis at all, instead it maintains a current set of namespaces, and adds/removes to it, as it is traversing the document. This is much more efficient then looking at the ancestor axis for every element. Pratik
Received on Tuesday, 5 May 2009 01:18:51 UTC