Re: streamable xpath subsets

For the binary data as base64, I had suggested that we do support XPath, 
but this XPath selects the element that is the parent of the text node, 
not the text node itself.

The problem is that text nodes are often very large. Streaming parsers 
like StAX often split up text nodes into many adjacent text nodes (StaX 
has isCoalescing property which controls this) . E.g. if there is a text 
node that is 1MB in size it might be split up into a thousand nodes of 
1KB each.  So if we allow an XPath to select text nodes, the XPath 
implementation would need to coalesce adjacent text nodes into one which 
is very inefficient.  Many academic XPath implementation do not consider 
this issue, but this is a very big performance issue - we absolutely 
cannot assume that the whole text node can be loaded at once - it needs 
to be processed chunk by chunk. Note XML encryption is itself a big 
culprit for generating large text nodes - the <CipherValue> node is 
often extremely large.


This text node is one of the big reasons for many of the limitations 
that I had put. For example the "string-value" function for element 
nodes uses text node, so I had disallow that. 

Another source of limitations is the result of the XPath has to be 
subtree which has to be canonicalized and digested. That means you 
cannot have already entered the subtree during the XPath evaluation, 
otherwise you would have already read part of the subtree, and you are 
not allowed to reread it, when you are canonicalizing it.  E.g. suppose 
you have an expression like this  
/body/section[para/line] 
This XPath can be easily resolved by these Streaming XPath parsers that 
you mentioned, however once you have resolved it and found the section 
element, which has a para/line child, you have to go back to the section 
element and canonicalize it all, and going backwards is not allowed in 
streaming parser. This is why I have limited the predicate expression to 
only include attributes.

Let me try to figure out the way to put in the id function. id() either 
takes in a nodeset or a string and maybe we can only support the string.

I believe with all the feedback that we get, we can arrive precisely 
defined XPath subset, which cuts out all the features affecting 
streamability which still retaining all useful features.

Pratik






On 9/9/2009 4:36 PM, Ed Simon wrote:
> Pratik's document discussing limiting XPath productions allows only
> element nodes to be in the output node set. As I mentioned in this
> week's meeting: "It seems to me that a transform that results in a
> single text node should be supported. For example, an app stores binary
> data as base64 in an XML element and wants to hash (on signing and
> validation) the original raw binary. On validation, use XPath to select
> the text, then feed that to the base64-decoding before hashing.". I
> would suggest adding the ability to return a text node.
>
> More comments...
>
> Why isn't the id() function allowed? Could you please explain why that
> is not streamable?
>
> I've also been looking on the Web for research about efficient XPath
> streaming and have found a number of works including some describing
> very efficient streaming XPath algorithms that do not require many of
> the limitations proposed in Pratik's document. Here are some links:
>
> http://www.pms.informatik.uni-muenchen.de/publikationen/PMS-FB/PMS-FB-2001-16/slides_dagstuhl_2002.pdf
>
> http://cs.nyu.edu/~deepak/publications/icde.pdf
>
> http://xml.coverpages.org/BartonPLANX2002.pdf
>
> To me, the above articles suggest that many of the XPath limitations
> proposed in Pratik's document are unnecessary wrt streaming and
> efficiency. Comments?
>
> Ed
>
>
> On Tue, 2009-09-08 at 12:07 -0400, Frederick Hirsch wrote:
>   
>> I uploaded  a version of the Streamable XPath Subset document Pratik  
>> sent, so that colors are preserved on the web
>>
>> It is available at
>> http://www.w3.org/2008/xmlsec/Drafts/proposals/Streamable-XPath-subset.html
>>
>> I created a "proposals" directory in CVS for this sort of thing.
>>
>> Scott, perhaps you can include this link in the minutes so readers  
>> know what we were discussing.
>>
>> regards, Frederick
>>
>> Frederick Hirsch
>> Nokia
>>
>>
>>
>>
>>
>>     
>
>
>   

Received on Thursday, 10 September 2009 01:24:23 UTC