- From: <pratik.datta@oracle.com>
- Date: Tue, 29 Sep 2009 10:31:49 -0700
- To: edsimon@xmlsec.com
- CC: Frederick Hirsch <frederick.hirsch@nokia.com>, XMLSec WG Public List <public-xmlsec@w3.org>, Scott Cantor <cantor.2@osu.edu>
- Message-ID: <4AC24485.8020506@oracle.com>
Continuing this morning's discussion on streamable XPath. There are definitely certain Xpaths that cannot be evaluated by a streaming XPath implementation, no matter how advanced an algorithm it has. For example suppose you have a document like this with a 1000 <b> elements. <a> <b/> <b/> <b/> ... <b/> </a> And you want to evaluate the expression that wants to find out <b> that is at the 100th position counting backwards. i.e. /a/b[position() = last() - 100] How can you evaluate this in streaming parser where you can go only one pass? You do not know how many <b> elements there are, so you must reach the end, and then go back. Even if Xpath has some kind of lookahead cache, that cache cannot hold 900 <b> elements. (Or if it can, then you can change the problem to have 10,000 <b> elements) My point is that there definitely are XPaths that no streaming XPath implementation can evaluate, and these are too obvious to need a mathematical proof.. So we need to define a subset of XPaths that are streamable. We can be either be conservative and include a smaller subset, or try to find out the subset that a most advanced streaming XPath implementation can execute. My opinion is that we should go for a subset that includes common usage patterns. Pratik On 9/9/2009 6:22 PM, pratik.datta@oracle.com wrote: > For the binary data as base64, I had suggested that we do support > XPath, but this XPath selects the element that is the parent of the > text node, not the text node itself. > > The problem is that text nodes are often very large. Streaming parsers > like StAX often split up text nodes into many adjacent text nodes > (StaX has isCoalescing property which controls this) . E.g. if there > is a text node that is 1MB in size it might be split up into a > thousand nodes of 1KB each. So if we allow an XPath to select text > nodes, the XPath implementation would need to coalesce adjacent text > nodes into one which is very inefficient. Many academic XPath > implementation do not consider this issue, but this is a very big > performance issue - we absolutely cannot assume that the whole text > node can be loaded at once - it needs to be processed chunk by chunk. > Note XML encryption is itself a big culprit for generating large text > nodes - the <CipherValue> node is often extremely large. > > > This text node is one of the big reasons for many of the limitations > that I had put. For example the "string-value" function for element > nodes uses text node, so I had disallow that. > > Another source of limitations is the result of the XPath has to be > subtree which has to be canonicalized and digested. That means you > cannot have already entered the subtree during the XPath evaluation, > otherwise you would have already read part of the subtree, and you are > not allowed to reread it, when you are canonicalizing it. E.g. > suppose you have an expression like this > /body/section[para/line] > This XPath can be easily resolved by these Streaming XPath parsers > that you mentioned, however once you have resolved it and found the > section element, which has a para/line child, you have to go back to > the section element and canonicalize it all, and going backwards is > not allowed in streaming parser. This is why I have limited the > predicate expression to only include attributes. > > Let me try to figure out the way to put in the id function. id() > either takes in a nodeset or a string and maybe we can only support > the string. > > I believe with all the feedback that we get, we can arrive precisely > defined XPath subset, which cuts out all the features affecting > streamability which still retaining all useful features. > > Pratik > > > > > > > On 9/9/2009 4:36 PM, Ed Simon wrote: >> Pratik's document discussing limiting XPath productions allows only >> element nodes to be in the output node set. As I mentioned in this >> week's meeting: "It seems to me that a transform that results in a >> single text node should be supported. For example, an app stores binary >> data as base64 in an XML element and wants to hash (on signing and >> validation) the original raw binary. On validation, use XPath to select >> the text, then feed that to the base64-decoding before hashing.". I >> would suggest adding the ability to return a text node. >> >> More comments... >> >> Why isn't the id() function allowed? Could you please explain why that >> is not streamable? >> >> I've also been looking on the Web for research about efficient XPath >> streaming and have found a number of works including some describing >> very efficient streaming XPath algorithms that do not require many of >> the limitations proposed in Pratik's document. Here are some links: >> >> http://www.pms.informatik.uni-muenchen.de/publikationen/PMS-FB/PMS-FB-2001-16/slides_dagstuhl_2002.pdf >> >> http://cs.nyu.edu/~deepak/publications/icde.pdf >> >> http://xml.coverpages.org/BartonPLANX2002.pdf >> >> To me, the above articles suggest that many of the XPath limitations >> proposed in Pratik's document are unnecessary wrt streaming and >> efficiency. Comments? >> >> Ed >> >> >> On Tue, 2009-09-08 at 12:07 -0400, Frederick Hirsch wrote: >> >>> I uploaded a version of the Streamable XPath Subset document Pratik >>> sent, so that colors are preserved on the web >>> >>> It is available at >>> http://www.w3.org/2008/xmlsec/Drafts/proposals/Streamable-XPath-subset.html >>> >>> I created a "proposals" directory in CVS for this sort of thing. >>> >>> Scott, perhaps you can include this link in the minutes so readers >>> know what we were discussing. >>> >>> regards, Frederick >>> >>> Frederick Hirsch >>> Nokia >>> >>> >>> >>> >>> >>> >> >> >>
Received on Tuesday, 29 September 2009 17:32:59 UTC