W3C home > Mailing lists > Public > public-xmlsec@w3.org > October 2009

Re: streamable xpath subsets

From: <pratik.datta@oracle.com>
Date: Thu, 01 Oct 2009 06:22:58 -0700
Message-ID: <4AC4AD32.2050600@oracle.com>
To: edsimon@xmlsec.com
CC: Frederick Hirsch <frederick.hirsch@nokia.com>, XMLSec WG Public List <public-xmlsec@w3.org>, Scott Cantor <cantor.2@osu.edu>
All the problem XPaths pointed out by Jeni are not included n the subset 
that I defined

   1.  |//section
      | Jeni says that this one is definitely streamable and it is in
      our subset.
   2. |//section[@type = 'summary']
      |As Jeni points out this is streamble in SAX and StAX which reads
      the entire start tag including attributes in one event.  It is
      also supported in .NET parsers. It is in our subset.
   3. |//section[title = 'Summary']
      |This is not in our subset. Our predicates allow only expressions
      using attributes.
   4. |//section[title[1] = 'Summary']
      |Again this not in our subset either.
   5. |/database[dummy]/record|
      This is also not in our subset. We allow predicates only on the
      last step, and also predicates cannot use elements, only attributes.

Pratik

On 9/30/2009 6:26 PM, Ed Simon wrote:
> Thanks Pratik,
>
> Found some more references to ponder:
>
> http://www.jenitennison.com/blog/node/61
>
> http://lists.xml.org/archives/xml-dev/200906/msg00160.html
>
> It appears quite a number of people have looked at streaming XPath and
> there is not necessarily agreement as to what the streamable subset is.
> Definitely an interesting topic!
>
> One of Jeni's points is that streamability may depend upon how much the
> XPath processor may know about the markup. I know that does not sound
> generic, but suppose one could dynamically provide the schema to the
> XPath processor before processing so that it could dynamically determine
> streamability.
>
> Ed
>
>
> On Tue, 2009-09-29 at 10:31 -0700, pratik.datta@oracle.com wrote:
>   
>> Continuing this morning's discussion on streamable XPath.
>>
>> There are definitely certain Xpaths that cannot be evaluated by a
>> streaming XPath implementation, no matter how advanced an algorithm it
>> has. For example suppose you have a document like this with a 1000 <b>
>> elements.
>>
>> <a>
>>   <b/>
>>   <b/>
>>   <b/>
>>   ...
>>   <b/>
>> </a>
>>
>> And you want to evaluate the expression that wants to find out <b>
>> that is at the 100th  position counting backwards.
>> i.e.
>>  /a/b[position() = last() - 100]
>>
>> How can you evaluate this in streaming parser where you can go only
>> one pass? You do not know how many <b> elements there are, so you must
>> reach the end, and then go back.  Even if Xpath has some kind of
>> lookahead cache, that cache cannot hold 900 <b> elements. (Or if it
>> can, then you can change the problem to have 10,000 <b> elements)
>>
>> My point is that there definitely are XPaths that no streaming XPath
>> implementation can evaluate, and these are too obvious to need a
>> mathematical proof.. So we need to define a subset of XPaths that are
>> streamable. We can be either be conservative and include a smaller
>> subset, or try to find out the subset that a most advanced streaming
>> XPath implementation can execute. My opinion is that we should go for
>> a subset that includes common usage patterns.
>>
>> Pratik
>>
>> On 9/9/2009 6:22 PM, pratik.datta@oracle.com wrote: 
>>     
>>> For the binary data as base64, I had suggested that we do support
>>> XPath, but this XPath selects the element that is the parent of the
>>> text node, not the text node itself.
>>>
>>> The problem is that text nodes are often very large. Streaming
>>> parsers like StAX often split up text nodes into many adjacent text
>>> nodes (StaX has isCoalescing property which controls this) . E.g. if
>>> there is a text node that is 1MB in size it might be split up into a
>>> thousand nodes of 1KB each.  So if we allow an XPath to select text
>>> nodes, the XPath implementation would need to coalesce adjacent text
>>> nodes into one which is very inefficient.  Many academic XPath
>>> implementation do not consider this issue, but this is a very big
>>> performance issue - we absolutely cannot assume that the whole text
>>> node can be loaded at once - it needs to be processed chunk by
>>> chunk. Note XML encryption is itself a big culprit for generating
>>> large text nodes - the <CipherValue> node is often extremely large. 
>>>
>>>
>>> This text node is one of the big reasons for many of the limitations
>>> that I had put. For example the "string-value" function for element
>>> nodes uses text node, so I had disallow that.  
>>>
>>> Another source of limitations is the result of the XPath has to be
>>> subtree which has to be canonicalized and digested. That means you
>>> cannot have already entered the subtree during the XPath evaluation,
>>> otherwise you would have already read part of the subtree, and you
>>> are not allowed to reread it, when you are canonicalizing it.  E.g.
>>> suppose you have an expression like this   
>>> /body/section[para/line]  
>>> This XPath can be easily resolved by these Streaming XPath parsers
>>> that you mentioned, however once you have resolved it and found the
>>> section element, which has a para/line child, you have to go back to
>>> the section element and canonicalize it all, and going backwards is
>>> not allowed in streaming parser. This is why I have limited the
>>> predicate expression to only include attributes.
>>>
>>> Let me try to figure out the way to put in the id function. id()
>>> either takes in a nodeset or a string and maybe we can only support
>>> the string.
>>>
>>> I believe with all the feedback that we get, we can arrive precisely
>>> defined XPath subset, which cuts out all the features affecting
>>> streamability which still retaining all useful features.
>>>
>>> Pratik
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 9/9/2009 4:36 PM, Ed Simon wrote: 
>>>       
>>>> Pratik's document discussing limiting XPath productions allows only
>>>> element nodes to be in the output node set. As I mentioned in this
>>>> week's meeting: "It seems to me that a transform that results in a
>>>> single text node should be supported. For example, an app stores binary
>>>> data as base64 in an XML element and wants to hash (on signing and
>>>> validation) the original raw binary. On validation, use XPath to select
>>>> the text, then feed that to the base64-decoding before hashing.". I
>>>> would suggest adding the ability to return a text node.
>>>>
>>>> More comments...
>>>>
>>>> Why isn't the id() function allowed? Could you please explain why that
>>>> is not streamable?
>>>>
>>>> I've also been looking on the Web for research about efficient XPath
>>>> streaming and have found a number of works including some describing
>>>> very efficient streaming XPath algorithms that do not require many of
>>>> the limitations proposed in Pratik's document. Here are some links:
>>>>
>>>> http://www.pms.informatik.uni-muenchen.de/publikationen/PMS-FB/PMS-FB-2001-16/slides_dagstuhl_2002.pdf
>>>>
>>>> http://cs.nyu.edu/~deepak/publications/icde.pdf
>>>>
>>>> http://xml.coverpages.org/BartonPLANX2002.pdf
>>>>
>>>> To me, the above articles suggest that many of the XPath limitations
>>>> proposed in Pratik's document are unnecessary wrt streaming and
>>>> efficiency. Comments?
>>>>
>>>> Ed
>>>>
>>>>
>>>> On Tue, 2009-09-08 at 12:07 -0400, Frederick Hirsch wrote:
>>>>   
>>>>         
>>>>> I uploaded  a version of the Streamable XPath Subset document Pratik  
>>>>> sent, so that colors are preserved on the web
>>>>>
>>>>> It is available at
>>>>> http://www.w3.org/2008/xmlsec/Drafts/proposals/Streamable-XPath-subset.html
>>>>>
>>>>> I created a "proposals" directory in CVS for this sort of thing.
>>>>>
>>>>> Scott, perhaps you can include this link in the minutes so readers  
>>>>> know what we were discussing.
>>>>>
>>>>> regards, Frederick
>>>>>
>>>>> Frederick Hirsch
>>>>> Nokia
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     
>>>>>           
>>>>   
>>>>         
>
>
>   
Received on Thursday, 1 October 2009 13:23:48 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:44:00 GMT