Re: streamable xpath subsets from pratik.datta@oracle.com on 2009-09-29 (public-xmlsec@w3.org from September 2009)

From: <pratik.datta@oracle.com>
Date: Tue, 29 Sep 2009 10:31:49 -0700
To: edsimon@xmlsec.com
CC: Frederick Hirsch <frederick.hirsch@nokia.com>, XMLSec WG Public List <public-xmlsec@w3.org>, Scott Cantor <cantor.2@osu.edu>
Message-ID: <4AC24485.8020506@oracle.com>
Continuing this morning's discussion on streamable XPath.

There are definitely certain Xpaths that cannot be evaluated by a 
streaming XPath implementation, no matter how advanced an algorithm it 
has. For example suppose you have a document like this with a 1000 <b> 
elements.

<a>
  <b/>
  <b/>
  <b/>
  ...
  <b/>
</a>

And you want to evaluate the expression that wants to find out <b> that 
is at the 100th  position counting backwards.
i.e.
 /a/b[position() = last() - 100]

How can you evaluate this in streaming parser where you can go only one 
pass? You do not know how many <b> elements there are, so you must reach 
the end, and then go back.  Even if Xpath has some kind of lookahead 
cache, that cache cannot hold 900 <b> elements. (Or if it can, then you 
can change the problem to have 10,000 <b> elements)

My point is that there definitely are XPaths that no streaming XPath 
implementation can evaluate, and these are too obvious to need a 
mathematical proof.. So we need to define a subset of XPaths that are 
streamable. We can be either be conservative and include a smaller 
subset, or try to find out the subset that a most advanced streaming 
XPath implementation can execute. My opinion is that we should go for a 
subset that includes common usage patterns.

Pratik

On 9/9/2009 6:22 PM, pratik.datta@oracle.com wrote:
> For the binary data as base64, I had suggested that we do support 
> XPath, but this XPath selects the element that is the parent of the 
> text node, not the text node itself.
>
> The problem is that text nodes are often very large. Streaming parsers 
> like StAX often split up text nodes into many adjacent text nodes 
> (StaX has isCoalescing property which controls this) . E.g. if there 
> is a text node that is 1MB in size it might be split up into a 
> thousand nodes of 1KB each.  So if we allow an XPath to select text 
> nodes, the XPath implementation would need to coalesce adjacent text 
> nodes into one which is very inefficient.  Many academic XPath 
> implementation do not consider this issue, but this is a very big 
> performance issue - we absolutely cannot assume that the whole text 
> node can be loaded at once - it needs to be processed chunk by chunk. 
> Note XML encryption is itself a big culprit for generating large text 
> nodes - the <CipherValue> node is often extremely large.
>
>
> This text node is one of the big reasons for many of the limitations 
> that I had put. For example the "string-value" function for element 
> nodes uses text node, so I had disallow that. 
>
> Another source of limitations is the result of the XPath has to be 
> subtree which has to be canonicalized and digested. That means you 
> cannot have already entered the subtree during the XPath evaluation, 
> otherwise you would have already read part of the subtree, and you are 
> not allowed to reread it, when you are canonicalizing it.  E.g. 
> suppose you have an expression like this  
> /body/section[para/line] 
> This XPath can be easily resolved by these Streaming XPath parsers 
> that you mentioned, however once you have resolved it and found the 
> section element, which has a para/line child, you have to go back to 
> the section element and canonicalize it all, and going backwards is 
> not allowed in streaming parser. This is why I have limited the 
> predicate expression to only include attributes.
>
> Let me try to figure out the way to put in the id function. id() 
> either takes in a nodeset or a string and maybe we can only support 
> the string.
>
> I believe with all the feedback that we get, we can arrive precisely 
> defined XPath subset, which cuts out all the features affecting 
> streamability which still retaining all useful features.
>
> Pratik
>
>
>
>
>
>
> On 9/9/2009 4:36 PM, Ed Simon wrote:
>> Pratik's document discussing limiting XPath productions allows only
>> element nodes to be in the output node set. As I mentioned in this
>> week's meeting: "It seems to me that a transform that results in a
>> single text node should be supported. For example, an app stores binary
>> data as base64 in an XML element and wants to hash (on signing and
>> validation) the original raw binary. On validation, use XPath to select
>> the text, then feed that to the base64-decoding before hashing.". I
>> would suggest adding the ability to return a text node.
>>
>> More comments...
>>
>> Why isn't the id() function allowed? Could you please explain why that
>> is not streamable?
>>
>> I've also been looking on the Web for research about efficient XPath
>> streaming and have found a number of works including some describing
>> very efficient streaming XPath algorithms that do not require many of
>> the limitations proposed in Pratik's document. Here are some links:
>>
>> http://www.pms.informatik.uni-muenchen.de/publikationen/PMS-FB/PMS-FB-2001-16/slides_dagstuhl_2002.pdf
>>
>> http://cs.nyu.edu/~deepak/publications/icde.pdf
>>
>> http://xml.coverpages.org/BartonPLANX2002.pdf
>>
>> To me, the above articles suggest that many of the XPath limitations
>> proposed in Pratik's document are unnecessary wrt streaming and
>> efficiency. Comments?
>>
>> Ed
>>
>>
>> On Tue, 2009-09-08 at 12:07 -0400, Frederick Hirsch wrote:
>>   
>>> I uploaded  a version of the Streamable XPath Subset document Pratik  
>>> sent, so that colors are preserved on the web
>>>
>>> It is available at
>>> http://www.w3.org/2008/xmlsec/Drafts/proposals/Streamable-XPath-subset.html
>>>
>>> I created a "proposals" directory in CVS for this sort of thing.
>>>
>>> Scott, perhaps you can include this link in the minutes so readers  
>>> know what we were discussing.
>>>
>>> regards, Frederick
>>>
>>> Frederick Hirsch
>>> Nokia
>>>
>>>
>>>
>>>
>>>
>>>     
>>
>>
>>
Received on Tuesday, 29 September 2009 17:32:59 UTC