Re: streamable xpath subsets - and additional 2.0 requirements? from Frederick Hirsch on 2009-12-29 (public-xmlsec@w3.org from December 2009)

From: Frederick Hirsch <Frederick.Hirsch@nokia.com>
Date: Tue, 29 Dec 2009 13:51:27 -0500
To: "edsimon@xmlsec.com" <edsimon@xmlsec.com>
Cc: Frederick Hirsch <Frederick.Hirsch@nokia.com>, "pratik.datta@oracle.com" <pratik.datta@oracle.com>, XMLSec WG Public List <public-xmlsec@w3.org>, Scott Cantor <cantor.2@osu.edu>
Message-Id: <C0A9F972-6EE3-4377-ACCA-3D436192F2A9@nokia.com>
I have some belated comments on this earlier message with links to  
material on streaming.

(1) Explicit streamability requirements

The papers suggest that it is hard to define streamability since it  
is  a continuum based on amount of buffering required in various  
circumstances, and degree of knowledge of schema and so on. I think we  
have neatly side-stepped this issue in the 2.0 requirements by not  
defining streaming, but rather stating specific requirements that  
enable streaming, such as not requiring the DOM and nodeset.

However we do not list these as specific requirements.

Should we add the following requirements to the 2.0 Requirements draft?

Requirements:
Do not require DOM implementation.
Do not specify XML Signature 2.0 processing in terms of nodesets.

Are there any other specific requirements we should list/

(2) no awareness of 2.0 patent issues

I am not aware of any patent issues related to 2.0 and streamability,  
especially since we talk in very general terms in the specification  
and avoid requiring nodesets and full XPath. Thus the concern of the  
second reference (which I have not investigated) is probably not an  
issue.
If anyone is aware of any issue, please raise it with the WG.

(3) Other ideas, out of scope even if useful.

The papers suggest some interesting ideas, such as preprocessing XPath  
for streamability. This might be useful but I suggest it is not in the  
scope of our immediate deliverables. Likewise, using schemas and  
evaluating streamability dynamically are likely out of our WG scope as  
well....

Thanks

regards, Frederick

Frederick Hirsch
Nokia



On Sep 30, 2009, at 9:26 PM, ext Ed Simon wrote:

> Thanks Pratik,
>
> Found some more references to ponder:
>
> http://www.jenitennison.com/blog/node/61
>
> http://lists.xml.org/archives/xml-dev/200906/msg00160.html
>
> It appears quite a number of people have looked at streaming XPath and
> there is not necessarily agreement as to what the streamable subset  
> is.
> Definitely an interesting topic!
>
> One of Jeni's points is that streamability may depend upon how much  
> the
> XPath processor may know about the markup. I know that does not sound
> generic, but suppose one could dynamically provide the schema to the
> XPath processor before processing so that it could dynamically  
> determine
> streamability.
>
> Ed
>
>
> On Tue, 2009-09-29 at 10:31 -0700, pratik.datta@oracle.com wrote:
>> Continuing this morning's discussion on streamable XPath.
>>
>> There are definitely certain Xpaths that cannot be evaluated by a
>> streaming XPath implementation, no matter how advanced an algorithm  
>> it
>> has. For example suppose you have a document like this with a 1000  
>> <b>
>> elements.
>>
>> <a>
>>  <b/>
>>  <b/>
>>  <b/>
>>  ...
>>  <b/>
>> </a>
>>
>> And you want to evaluate the expression that wants to find out <b>
>> that is at the 100th  position counting backwards.
>> i.e.
>> /a/b[position() = last() - 100]
>>
>> How can you evaluate this in streaming parser where you can go only
>> one pass? You do not know how many <b> elements there are, so you  
>> must
>> reach the end, and then go back.  Even if Xpath has some kind of
>> lookahead cache, that cache cannot hold 900 <b> elements. (Or if it
>> can, then you can change the problem to have 10,000 <b> elements)
>>
>> My point is that there definitely are XPaths that no streaming XPath
>> implementation can evaluate, and these are too obvious to need a
>> mathematical proof.. So we need to define a subset of XPaths that are
>> streamable. We can be either be conservative and include a smaller
>> subset, or try to find out the subset that a most advanced streaming
>> XPath implementation can execute. My opinion is that we should go for
>> a subset that includes common usage patterns.
>>
>> Pratik
>>
>> On 9/9/2009 6:22 PM, pratik.datta@oracle.com wrote:
>>> For the binary data as base64, I had suggested that we do support
>>> XPath, but this XPath selects the element that is the parent of the
>>> text node, not the text node itself.
>>>
>>> The problem is that text nodes are often very large. Streaming
>>> parsers like StAX often split up text nodes into many adjacent text
>>> nodes (StaX has isCoalescing property which controls this) . E.g. if
>>> there is a text node that is 1MB in size it might be split up into a
>>> thousand nodes of 1KB each.  So if we allow an XPath to select text
>>> nodes, the XPath implementation would need to coalesce adjacent text
>>> nodes into one which is very inefficient.  Many academic XPath
>>> implementation do not consider this issue, but this is a very big
>>> performance issue - we absolutely cannot assume that the whole text
>>> node can be loaded at once - it needs to be processed chunk by
>>> chunk. Note XML encryption is itself a big culprit for generating
>>> large text nodes - the <CipherValue> node is often extremely large.
>>>
>>>
>>> This text node is one of the big reasons for many of the limitations
>>> that I had put. For example the "string-value" function for element
>>> nodes uses text node, so I had disallow that.
>>>
>>> Another source of limitations is the result of the XPath has to be
>>> subtree which has to be canonicalized and digested. That means you
>>> cannot have already entered the subtree during the XPath evaluation,
>>> otherwise you would have already read part of the subtree, and you
>>> are not allowed to reread it, when you are canonicalizing it.  E.g.
>>> suppose you have an expression like this
>>> /body/section[para/line]
>>> This XPath can be easily resolved by these Streaming XPath parsers
>>> that you mentioned, however once you have resolved it and found the
>>> section element, which has a para/line child, you have to go back to
>>> the section element and canonicalize it all, and going backwards is
>>> not allowed in streaming parser. This is why I have limited the
>>> predicate expression to only include attributes.
>>>
>>> Let me try to figure out the way to put in the id function. id()
>>> either takes in a nodeset or a string and maybe we can only support
>>> the string.
>>>
>>> I believe with all the feedback that we get, we can arrive precisely
>>> defined XPath subset, which cuts out all the features affecting
>>> streamability which still retaining all useful features.
>>>
>>> Pratik
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 9/9/2009 4:36 PM, Ed Simon wrote:
>>>> Pratik's document discussing limiting XPath productions allows only
>>>> element nodes to be in the output node set. As I mentioned in this
>>>> week's meeting: "It seems to me that a transform that results in a
>>>> single text node should be supported. For example, an app stores  
>>>> binary
>>>> data as base64 in an XML element and wants to hash (on signing and
>>>> validation) the original raw binary. On validation, use XPath to  
>>>> select
>>>> the text, then feed that to the base64-decoding before hashing.". I
>>>> would suggest adding the ability to return a text node.
>>>>
>>>> More comments...
>>>>
>>>> Why isn't the id() function allowed? Could you please explain why  
>>>> that
>>>> is not streamable?
>>>>
>>>> I've also been looking on the Web for research about efficient  
>>>> XPath
>>>> streaming and have found a number of works including some  
>>>> describing
>>>> very efficient streaming XPath algorithms that do not require  
>>>> many of
>>>> the limitations proposed in Pratik's document. Here are some links:
>>>>
>>>> http://www.pms.informatik.uni-muenchen.de/publikationen/PMS-FB/PMS-FB-2001-16/slides_dagstuhl_2002.pdf
>>>>
>>>> http://cs.nyu.edu/~deepak/publications/icde.pdf
>>>>
>>>> http://xml.coverpages.org/BartonPLANX2002.pdf
>>>>
>>>> To me, the above articles suggest that many of the XPath  
>>>> limitations
>>>> proposed in Pratik's document are unnecessary wrt streaming and
>>>> efficiency. Comments?
>>>>
>>>> Ed
>>>>
>>>>
>>>> On Tue, 2009-09-08 at 12:07 -0400, Frederick Hirsch wrote:
>>>>
>>>>> I uploaded  a version of the Streamable XPath Subset document  
>>>>> Pratik
>>>>> sent, so that colors are preserved on the web
>>>>>
>>>>> It is available at
>>>>> http://www.w3.org/2008/xmlsec/Drafts/proposals/Streamable-XPath-subset.html
>>>>>
>>>>> I created a "proposals" directory in CVS for this sort of thing.
>>>>>
>>>>> Scott, perhaps you can include this link in the minutes so readers
>>>>> know what we were discussing.
>>>>>
>>>>> regards, Frederick
>>>>>
>>>>> Frederick Hirsch
>>>>> Nokia
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>
Received on Tuesday, 29 December 2009 18:53:11 UTC