- From: John Boyer <JBoyer@PureEdge.com>
- Date: Thu, 24 Jan 2002 10:58:12 -0800
- To: "merlin" <merlin@baltimore.ie>
- Cc: <reagle@w3.org>, <w3c-ietf-xmldsig@w3.org>
Hi Merlin, Your point on the complexity of the expression to be evaluated is well-taken. One point of interest however, is that the bound is not, strictly speaking, O(log n). In the worst case it can be very much worse, but cases in practice it is likely to be very much better than O(log n). The reason is that XML documents tend to lack depth relative to the number of nodes. The O(log n) rating would only apply in cases where a constant branching factor is guaranteed in the parse tree, whereas nodes in real XML documents tend to have lots more siblings than descendants. Perhaps more to the point, Donald Eastlake's recent communications with me are most instructive. In them, he pointed out alternatives to some of my earlier expressions, which were a bit kludgy because I knew a lot less about XPath back when they were created. The predicate you mentioned: count(id('to-be-signed') | ancestor-or-self::node()) = count(ancestor-or-self::node()) permits only those nodes whose ancestor count is equal to the ancestor count plus a particular node identified by id. In other words, the predicate permits a node identified by id and the node's descendants. I came up with this before I really understood the issues in capturing namespace and attribute nodes (in the ancestor direction, namespace and attribute nodes take element parents, but these types of nodes do not register as 'children' of that element so the descendant-or-self::node() test does not match them). Only later did I discover the //@* and //namespace::* notation when trying to find some way of expressing the default expression for canonicalization, which should be default include all nodes. Donald combined this with the notion of using the id function to obtain a subtree root, resulting in the logically equivalent expression: (id('to-be-signed')//. | id('to-be-signed')//@* | id('to-be-signed')//namespace::* ) Presumably, the id function will be backed by a hash table, providing constant expected time lookup of the subtree root. The next step after each id() function invocation calls for a linear time traversal of the subtree, performing a constant time test at each node. Finally, as long as the union operations are done intelligently as merge operations, the whole thing comes out as worst case O(n) where n is the number of nodes in the subtree (not in the document!). This should be remarkably faster. As for the other larger expression you included below, I could benefit from a thorough description of its context as there *may* be a way to do it more efficiently, but I'm a little short of time and it is not immediately obvious exactly what it is doing. Give me a bit more context, and I'll bang around in the XPath spec a bit more and see if something pops out. It looks like you are trying to exclude a subtree rather than include a subtree, in which case it may be hard to come up with something because XPath has no native support for set subtract or intersect. We may have to get creative (add a function or something). Cheers, John Boyer PureEdge Solutions Inc. r/JBoyer@pureedge.com/2002.01.23/15:15:35 >Finally, as to the speed issue that the originator is experiencing, I'd >say he is suffering from a non-optimized implementation. The reason is >that an O(n) traversal of the nodes of a parse tree is sufficient for >including all such nodes in a node-set. Since all members of the >node-set will be rendered in a serialization, the cost is O(n). Thus, >the cost of putting all nodes in a node-set should (or can at least) be >commensurate with serialization. If it is taking progressively longer >to calculate the node-set as the XML document grows, then the XPath >implementation needs to be optimized. Alternately, we only require >behavior equivalent to the evaluation of the (//. | //@* | >//namespace::*) expression. The cost is more like O(n log(n)) because our use of XPath as a per-node filter often requires evaluating, for each node in the document, something along the lines of: count(id('to-be-signed') | ancestor-or-self::node()) = count(ancestor-or-self::node()) Here, ancestor-or-self is going to be an O(log(n)) operation. In the originator's case he is trying to do an XPointer and winding up with: URI="", XPath= count((here()/ancestor::aida:eDocument[1]/child::aida:signedContent[1]// . | here()/ancestor::aida:eDocument[1]/child::aida:signedContent[1]//@* | here()/ancestor::aida:eDocument[1]/child::aida:signedContent[1]//namespa ce::*) | self::node()) = count((here()/ancestor::aida:eDocument[1]/child::aida:signedContent[1]// . | here()/ancestor::aida:eDocument[1]/child::aida:signedContent[1]//@* | ere()/ancestor::aida:eDocument[1]/child::aida:signedContent[1]//namespac e:: *)) This is O(log(n)) per node in the document (and you have to admit that there is going to be a hefty constant in front of that O), which is painful as the document size grows. One option (off the top of my head, so I've not given it mich thought) would be to introduce a new XPath intersection transform: <Transform Algorithm="http://www.w3.org/TR/1999/REC-xpath-19991116#intersection"> <XPath> here()/ancestor::aida:eDocument[1]/child::aida:signedContent[1]//. | here()/ancestor::aida:eDocument[1]/child::aida:signedContent[1]//@* | here()/ancestor::aida:eDocument[1]/child::aida:signedContent[1]//namespa ce::* </XPath> </Transform> The result of this would be the intersection of the input node set and the node set that results from evaluating the transform. This would be entirely XPath based (no need for XPtr), and would drop things to O(n) (with a significantly-reduced constant) for doing signature-relative operations. Merlin ------------------------------------------------------------------------ ----- Baltimore Technologies plc will not be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message by a third party or as a result of any virus being passed on. This footnote confirms that this email message has been swept by Baltimore MIMEsweeper for Content Security threats, including computer viruses. http://www.baltimore.com
Received on Thursday, 24 January 2002 13:58:44 UTC