Re: New Version of XPath Filter from merlin on 2002-04-08 (w3c-ietf-xmldsig@w3.org from April to June 2002)

From: merlin <merlin@baltimore.ie>
Date: Tue, 09 Apr 2002 00:05:12 +0100
To: aleksey@aleksey.com
Cc: John Boyer <JBoyer@PureEdge.com>, reagle@w3.org, w3c-ietf-xmldsig@w3.org
Message-Id: <20020408230512.1AA9143BEA@yog-sothoth.ie.baltimore.com>
Hi Aleksey,

In terms of performance, it is my experience that selecting nodes by XPath
and performing the subtree expansion externally gives between 1.5x and 2x
speedup (in two nonrepresentative tests). Not huge, but bear in mind that
performance _was_ the reason we designed this transform. It's a free win,
and it would seem a practical impossibility for an implementation to do
this more slowly than a general XPath processor.

In terms of the 'user experience' and 'developer experience', the idea of
selecting subtrees is not new. The reference '#foo' evaluates to a single
element which we then expand to include the entire subtree rooted at that
element. The reference '#xpointer(XXX)' effectively evaluates the XPath
expression (see the Expr production) XXX and expands the resulting node
set to be the union of the subtrees rooted at each selected node. The user
must be familiar with this operation to use same-document references, and
the developer must have this code to implement same-document references.

I agree that there is a fine line between using the XPath node set
directly and automatically expanding the subtrees, however the goal
of this transform is to *maximize the performance of common signature
transforms*. We don't have a requirements document; but if we did,
then that would be its essence. And, from experience, I can tell you
that common signature transforms apply to subtrees, they do not apply
to nodes in isolation.

As a side-effect of automatically expanding subtrees, it also happens
that we maximize the simplicity of expressing these common transforms.
To select an element and its children, just use id('foo'). That's
familiar and convenient. I would argue that the alternative,
id('foo')/descendant-or-self::node() is less 'user friendly'. Sure,
you have to read the spec to know what's going on, but that happens
whatever choices we make.

Yes, there are arguments either way, but by the choices we have made, I
think that we are effectively meeting our requirements. If you can
demonstrate that this choice has a negative impact on performance, then
you'll have demonstrated that we are mistaken.

Merlin

r/aleksey@aleksey.com/2002.04.08/15:00:28
>John,
>
>Thanks for your message. After reading it I think I understand your 
>position and
>why this S' scares me. You are absolutelly right that the current XPath 
>1.0 transform
>requires checking every node in the input nodes set and excluding a node 
>if the XPath
>expression is evaluated to "false". However, you can describe the same 
>operation in
>a different way:
>    apply XPath expression to the input document and calculate the 
>result nodes set and
>    input nodes set intersection
>(of course, XPath expression is applied after adding "(//. | //@* | 
>//namespace::*)[XXX]" )
>In this interpretation, the XPath 1.0 transform performance is *the 
>same* as the
>proposed XPath 2.0 transform performance and there is no complicated 
>optimization!
>
>Also if you think about the current XPath transform in this way then you 
>can imagine
>that the only thing you want to add to this transform is to make it 
>general and have
>"intersect" , "union" and "subtract" operations instead of "intersect" 
>only. It seems natural
>and simple to me and I can explain the transform in one phrase (someone 
>said that if
>you could not explain your theory using one phrase then probably you 
>have to think
>more about it :) ).
>
>However, the proposed XPath 2.0 transform is different because of this 
>additional step.
>I could not really evaluate the performance difference of adding this 
>step. As you've noted
>the XPath expressions are different and implementations are different. 
>For some implementations
>it could be faster to add all childs in the XPath expression itself, for 
>some implementations
>it could be faster to add childs using this additional step. And real 
>performance will depend on
>what will be most common situation: application wants to include/exclude 
>only the node or
>the node plus all childs. From my point of view, in such complicated 
>situations a nice old
>"Occam's Razor" rule should work: if the XPath expression evaluator can 
>do the job why
>we need to add this functionality to XPath transform?
>
>
>Aleksey.
>
>
>
>John Boyer wrote:
>
>>Hi Aleksey,
>>
>>-----Original Message-----
>>From: Aleksey Sanin [mailto:aleksey@aleksey.com]
>>Sent: Monday, April 08, 2002 12:03 PM
>>To: John Boyer
>>Cc: merlin; reagle@w3.org; w3c-ietf-xmldsig@w3.org
>>Subject: Re: New Version of XPath Filter
>>
>>
>>I need to think about profiling this scenarios. Probably you are right 
>>about performance.
>>However, personally I don't think that it's a good idea to write general
>>
>>standard with
>>performance as a top priority. My expirience shows that performance 
>>improvements are
>>usually very complicated, not simple to explain and very difficult to 
>>maintain.
>>
>><jb>
>>Actually, Xpath filter v1.0 is what resulted from trying to write the
>>standard without concern for speed.  We tried it this way and found that
>>the right idea has a performance profile that is quite unacceptable.
>>So, we have no other goal than to achieve roughly the same functionality
>>in a more efficient manner.  My experience also shows that performance
>>improvements are complicated and difficult to maintain **in an
>>implementation**.  XPath filter v1.0 could be optimized, possibly to an
>>acceptable level of performance, but only by means that are very
>>difficult to implement and therefore unlikely to be implemented
>>uniformly.  Because the performance profile can become an
>>interoperability issue in this case, it has become necessary to create
>>an architecture in which the default non-optimized version achieves
>>acceptable performance.
>></jb>
>>
>>Calculating S' is one additional step and this step adds complexety to 
>>the proposed scheme.
>>
>><jb>
>>No, it is not an additional step in the sense of performance.  It only
>>appears that way because we separate it out as an operation in the
>>processing model.  However, any XPath evaluation that results in a
>>node-set containing S' plus all descendants of nodes in S' has by
>>definition performed at least as much work as it takes to compute S'.
>></jb>
>>
>>I like the initial idea: describe XPath transform in terms of sets 
>>operations. It really simplifies
>>understanding and makes everything clear. I suggest you to take a look 
>>at this from a "user"
>>point of view: "I've constructed XPath expression but transform added 
>>some nodes to it. Why???"
>>And remember that users never read docs :)
>>
>><jb>
>>We add nodes because that is the definition of what we do.  It is
>>unreasonable to expect that a person will successfully construct a
>>signature filter without knowing something as basic as what nodesets
>>will result.  This is why we *require* familiarity with companion specs.
>>This is also why every spec example shows only the subtree root being
>>selected.
>>
>>For what it's worth, though, even a user like the one you describe would
>>have difficulty messing up this transform.  Suppose the transform
>>expression identifies all subtree nodes, not just the subtree roots.
>>The behavior of the transform would not change.
>>
>>The user would have to write an XPath which tries to include some nodes
>>then exclude some of their children in the same expression.  The idea
>>that the user would try to do this in a text editor rather than a
>>helpful design environment AND that the user would not read a single
>>example-- even if it is after the first transform does not do what
>>he/she wants-- seems to be creating quite the hypothetical user indeed.
>></jb>
>>
>>And I think you agree with me that your solution for the second scenario
>>
>>I described (when only
>>node "b" is required) has worse performance then a solution w/o S'.
>>
>><jb>
>>I'm afraid I wouldn't agree with this either.  My current belief is that
>>running one XPath of the type you describe will run far slower than the
>>two transforms.  XPath has a problem:  it is easy to select nodes, it is
>>hard to exclude nodes.  We are trying to overcome that problem by using
>>Xpath's natural ability for identifying nodes to identify nodes we want
>>to get rid of, then let the semantics of the host language (DSig in this
>>case) do the exclusion because it can be done efficiently.
>>
>>Rather than considering the evaluation of every XPath expression as
>>equal, consider the fact that some Xpath expressions can be far more
>>than twice as costly as other ones.
>>
>>Cheers, 
>>John Boyer
>></jb>
>>
>
>


-----------------------------------------------------------------------------
Baltimore Technologies plc will not be liable for direct,  special,  indirect 
or consequential  damages  arising  from  alteration of  the contents of this
message by a third party or as a result of any virus being passed on.

This footnote confirms that this email message has been swept by
Baltimore MIMEsweeper for Content Security threats, including
computer viruses.
   http://www.baltimore.com
Received on Monday, 8 April 2002 19:05:51 UTC