RE: Alternative XPath Filter AND RE: Performace data and comparison from John Boyer on 2002-05-22 (w3c-ietf-xmldsig@w3.org from April to June 2002)

From: John Boyer <JBoyer@PureEdge.com>
Date: Tue, 21 May 2002 17:20:42 -0700
To: "Christian Geuer-Pollmann" <geuer-pollmann@nue.et-inf.uni-siegen.de>, "merlin" <merlin@baltimore.ie>
Cc: <reagle@w3.org>, <w3c-ietf-xmldsig@w3.org>
Message-ID: <7874BFCCD289A645B5CE3935769F0B5232874D@tigger.PureEdge.com>
Hi Christian,

I apologize that I could not get around to this sooner, though I can
hardly believe the lucky set of circumstances that freed up time today.

Out of respect for the amount of effort that you have put into proposing
the alternative subtree-labelling scheme, it seemed only fair to have a
closer look at it.  In particular, I found compelling your statements
about the opportunity to optimize.

Firstly, I think there is an easily correctable and small error in the
processing model you proposed some time ago.  You proposed an 'include'
variable that would track the default behavior to be used on unlabelled
nodes.  This needs to be more like a stack, or alternately the include
variable needs to be reset at a node before visiting each of its
children, not just before visiting the first child.  The reason is that
the default include setting for the nodes of a subtree is corrupted if
one descends into a subsubtree that must use the include variable to set
the default for its descendant nodes.

Other than that, the summative thought that I have on your method is
that most of the power to optimize seems to be if one is doing multiple
transformations *under the assumption* of a processign data model that
we unfortunately do not have.  The problem is that the processing model
allows for octet streams or node-sets to be passed from transform to
transform.  To take advantage of optimized tree labeling, one would have
to augment the node-set with the labels. 

One alternative, which appears in your early proposals, is to include
multiple XPath elements in the same transform.  Within the transform, we
can do whatever we like to augment the node-set.  This is where I see
your proposal deriving the most benefit, but I also believe that subtree
set operations could be similarly optimized.  In other words, the tree
labeling concept seems to be a clever way to optimize multiple set
operations performed in the same transform.  Naturally, the processing
details and the semantics of the labels would differ, particularly to
account for the order in which the labels were affixed.  For example, a
union of a subtree appearing after a subtract means something different
than a union before a subtract.  Intersection is perhaps the most
difficult, but offhand I don't see why the details cannot be properly
worked out if each subtree label is augmented with an 'order' integer.

This leaves us with an interesting thought.  The DSig processing model
states that the application must behave 'as if' node-sets are being
passed from transform to transform, not that this is what must be done
in fact.  Thus, whether your suggested semantics are used or whether we
stick to subtree set operations, it seems to me that an implementation
can be created which is based solely on a labelling method, at least
until a non-XPath transform is reached.

So, what does your proposed method do that we cannot achieve?  The
answer appears to be that it is easier to work out the rules for a
labelling optimization for multiple transformations because the
semantics are simpler.  However, with particular regard to the case of
one or two xpath filter transforms, esp. if it is a subtract on the
whole document, the optimization for set operations is not as complex.

In turn, I think this answers the highly compelling question you raised,
which is whether the existing set-based operations could be used to
achieve time(transform+c14n) = time(1--3 
xpath evals)+time(c14n) if there is c14n directly after the current
xfilter transform.

As far as I can tell, the most important transformation sequence to
optimize is a subtract followed by a possible union indicating
subsubtrees of the subtrees that were subtracted from a full document.
In this case, the subtract nodes are labelled exclude-but-search and the
union nodes are labelled 'include' (if there is no following union, the
subtract nodes can simply be labelled 'exclude').  The second most
important case is an intersect followed by a possible subtract.  The
intersect nodes are marked 'include-but-search' and the subtract nodes
are marked 'exclude' (again, if there is no following subtract, then the
intersect nodes can be marked as 'include').  In the first case, the
default global 'include' is true and in the second case the global
default is false.

Regardless of the case, a c14n could utilize the labels in lieu of a
node-set, with the result meeting the time bound you specified.

Finally, due to the fact that we are still not getting quite the
performance I had hoped for out of all implementations, I think that we
should mandate some of these optimizations.

Thanks,
John Boyer



-----Original Message-----
From: Christian Geuer-Pollmann
[mailto:geuer-pollmann@nue.et-inf.uni-siegen.de]
Sent: Wednesday, May 15, 2002 2:00 AM
To: merlin
Cc: reagle@w3.org; John Boyer; w3c-ietf-xmldsig@w3.org
Subject: Re: Performace data and comparison 


Hi Merlin,

I must disagree that the proposed XPath filter is THAT intuitive: my 
biggest problem was to call that transform a "filter". To me, a filter
is a 
transform which subsequently "removes" something (the filtered-out 
material) from the to-be-filtered material. So the result is always 
"smaller" than the input. The xfilter2/union operation is very strange
with 
respect to that model [1]. People must be VERY careful that the do not 
construct steps which re-include material the previously filtered out
and 
which must not be signed. A e.g. union(/) would destroy the results of
all 
previously done transforms. The same goes for comments. I use a URI=""
but 
the output contains comments. This does not really add clarity.

But xfilter2 must have the union selection, otherwise it wouldn't be 
possible to select arbitrary content.

About grouping xfilter2 transforms: It's a nice to have, but I've seen
that 
there is no way to use this for optimization. (OK, a stuupid union(/) at

the end would allow 'optimization', but if the steps are written good, 
that's not possible).

My labeling proposal is NOT that complicated:

1: Clarity: I did not write a fully-flavoured spec for that idea like
you 
did, so well, OK, it lacks clarity. Would clarity bee added if I write a

full w3c-like document? I did not do till now because I've seen great 
resistance to even have a closer look at this babe.

2: Intuitiveness: The output result node set can only contain nodes
which 
have also been in the input node set. That's what I understand a
transform 
(and a filter) works like.

3: Simplicity: I directly included the processing model for the
implementor 
in my proposal. Maybe that's too complicated for users. But the model 
itself is totally simple: For a given node C in the input node set, it
must 
be checked whether he's in the output or not. Label the tree. Walk an
the 
ancestor-or-self axis of the context node C. The first label decides: if

it's exclude or excludeButSearch, omit that node; if it's include,
include 
it. That's it. End of magic. Not simple?

4: *real-world* problems: Sort of killer argument. Personally, I'm at
the 
university, I do research on that topic, I don't deploy business
solutions 
-> I can't supply ANY real-world problems which you require.

5: Preformance: What costs time in my implementation is the number of
XPath 
evaluations which have to be done. This number ranges (for my approach) 
between a minimum of one (1) and a maximum of three (3), while the 
third-one is a tradeoff between document size (time to traverse a 
fully-excluded subtree) and and the time required to evaluate one XPath.
So 
REQUIRED are a maximum of two (2) XPath evaluations.

5a: The super-duper chance for optimization is in the combination with 
c14n: If there is c14n directly after my transform, both can be done in
a 
single step, with the following timing: time(transform+c14n) = time(1--3

xpath evals)+time(c14n). NO additional costs. Can xfilter2 do that?.

5b: The xfilter2 transform enforces that the number of XPath evaluations
is 
the same as the number of transforms.

Regards,
Christian

[1] 
http://lists.w3.org/Archives/Public/w3c-ietf-xmldsig/2002JanMar/0250.htm
l

--On Montag, 13. Mai 2002 20:10 +0100 merlin <merlin@baltimore.ie>
wrote:

> Hi Christian,
>
> The proposed XPath filter is clear and intuitive and it
> effectively and efficiently solves the problems that it sets
> out to address. I think it might be a good idea to allow
> multiple XPath children of the filter transform; that would
> allow a sequence of steps to be logically grouped, and might
> help implementers perform optimizations.
>
> In my opinion, your labeling proposal lacks the clarity and
> simplicity of the XPath filter and is not addressing a problem
> that needs solving, based on my understanding of the needs
> of current XML signature applications.
>
> If the purpose of this transform was to solve the types of
> problem that you are posing, then I would grant that a pure
> set-based solution is not as efficient as possible, and that
> your proposal would merit consideration.
>
> However, that is not the purpose of this transform. Your
> own data show that the current proposal is more efficient
> than yours for *real-world* problems, which are the types of
> problem that we need to solve.
>
> Even if the performance tipped slightly the other way, I
> think that set-based operations on subtrees are a much more
> intuitive and accessible solution than yours, and so I would
> still advocate them. If you can, as I have asked before,
> produce a real-world problem that is badly solved by our
> transform, then please present it. I do not count the
> example below (yes, I've looked at it) as a real-world
> problem. Otherwise, I think your own data show that we have
> a good solution.
>
> Merlin
>
> r/geuer-pollmann@nue.et-inf.uni-siegen.de/2002.05.13/16:04:48
>> --On Donnerstag, 9. Mai 2002 23:50 +0200 Christian Geuer-Pollmann
>> <geuer-pollmann@nue.et-inf.uni-siegen.de> wrote:
>>
>>> When you look at the results below, you see that each step in the
>>> xfilter2spec_xfilter2_(1/2/3) adds more processing time, while this
is
>>> not the case for 'my' transform: I can't tell why they are as they
are.
>>
>> I constructed an obfuscated example in which I select some subtrees
for
>> inclusion/exclusion. To get the final result, I must perform 7
xfilter2
>> transforms. Each transform adds more overhead. To test how much each
>> transform requires, I ran the test suite 10 times, while the first
test
>> only did the 1st transform, the second test included the 1st and the
2nd
>> transform while the 7th test included all seven steps. The results
are
>> shown below
>>
>> 10 * apachesample_xfilter2_1     took  5,097 seconds
>> 10 * apachesample_xfilter2_2     took  5,879 seconds
>> 10 * apachesample_xfilter2_3     took  6,790 seconds
>> 10 * apachesample_xfilter2_4     took  7,290 seconds
>> 10 * apachesample_xfilter2_5     took  8,442 seconds
>> 10 * apachesample_xfilter2_6     took  9,503 seconds
>> 10 * apachesample_xfilter2_7     took 10,475 seconds
>>
>> You can see that the required time is strictly monotonic increasing
for
>> each added transform step. To check that against my own algorithm, I
>> write  7 transforms which achieve the same results as the
corresponding
>> xfilter2  ones:
>>
>> 10 * apachesample_apachefilter_1 took  5,207 seconds
>> 10 * apachesample_apachefilter_2 took  5,638 seconds
>> 10 * apachesample_apachefilter_3 took  5,408 seconds
>> 10 * apachesample_apachefilter_4 took  5,478 seconds
>> 10 * apachesample_apachefilter_5 took  5,528 seconds
>> 10 * apachesample_apachefilter_6 took  5,508 seconds
>> 10 * apachesample_apachefilter_7 took  5,358 seconds
>>
>> You can  see that the time is -- well, not constant, but does not
show
>> this  heavy dependency to take longer for complexer selections.
>>
>> BTW, if you wanna check this yourself, the implementation of the
>> transform  [1] is in the Apache CVS, the test suite also [2].
>>
>> Kind regards,
>> Christian
>>
>>
>> [1]
>>
<http://cvs.apache.org/viewcvs.cgi/xml-security/src/org/apache/xml/secur
>> ity
>>
/transforms/implementations/TransformXPathFilterCHGP.java?rev=1.2&conten
>> t-t ype=text/vnd.viewcvs-markup>
>>
>> [2]
>>
<http://cvs.apache.org/viewcvs.cgi/xml-security/src_samples/org/apache/x
>> ml/
>>
security/samples/TransformPerformanceTester.java?rev=1.3&content-type=te
>> xt/ vnd.viewcvs-markup>
Received on Tuesday, 21 May 2002 20:21:29 UTC