Re: Tree construction: Coalescing text nodes from Geoffrey Sneddon on 2009-11-11 (public-html@w3.org from November 2009)

From: Geoffrey Sneddon <gsneddon@opera.com>
Date: Wed, 11 Nov 2009 13:12:58 +0100
To: Ian Hickson <ian@hixie.ch>
CC: public-html@w3.org, pjt47@cam.ac.uk
Message-ID: <4AFAAA4A.7080000@opera.com>
On Fri, 29 Aug 2008, Ian Hickson wrote:

> On Sat, 31 May 2008, Philip Taylor wrote:
>> 
>> Consider a document like:
>> 
>> a</li>b</li>c</li>d...
>> 
>> The spec says that should be a single text node "abcd...". IE 6,
>> Firefox 3 and Safari 3 do that; Opera 9.5 instead splits it into
>> lots of text nodes.
>> 
>> Also consider:
>> 
>> a<table>b<td></td>c<td></td>d...
>> 
>> The spec says that should be a single text node "abcd...", then a
>> table. Firefox 3 and Safari 3 don't do that, and instead split it
>> into lots of adjacent text nodes. (IE and Opera don't do table
>> foster parenting like this at all.)
>> 
>> In many implementations (particularly when strings are immutable,
>> e.g. in Python), constructing the string "abcd..." by concatenating
>> lots of individual characters one at a time has cost O(n^2) in the
>> length of the string.
>> 
>> That problem can (I think) be avoided in the first case without too
>> much trouble: the parser's "append text node to current node"
>> method can build up a list of strings, and then they can be flushed
>> (concatenate all strings, append result to current node) once the
>> parser calls a different method ("append element to current node"
>> etc). But I can't see an adequately efficient way to avoid the
>> problem in the table case, particularly if scripts are allowed to 
>> observe and modify the DOM as it is being parsed.
>> 
>> So this is a DOS vulnerability, since an attacker can do 'n' work
>> sending a document to someone and cause them to do O(n^2) work
>> parsing it (even without scripting).
>> 
>> Rather than relying on implementors to notice and fix this problem
>> in incompatible ways, the spec should be changed to require
>> coalescence only in the cases where the parser is consecutively
>> appending text nodes to the current node and not in any other
>> cases.
> 
> I've tried to fix this. Please let me know if the fix (r2124) works
> for you. I'm not sure exactly what the implications of my fix are.

The change made to the spec makes the relevant text say:

> When the steps below require the UA to insert a character into a
> node, if that node has a child immediately before where the character
> is to be inserted, and that child is a Text node, and that Text node
> was the last node that the parser inserted into the document, then
> the character must be appended to that Text node; otherwise, a new
> Text node whose data is just that character must be inserted in the
> appropriate place.

This leads, as far as I can tell, to having adjacent text nodes only in 
the case where text from within a table is foster parented (ignoring 
cases where text nodes are explicitly inserted into the DOM via 
scripting). As such, this will only solve the problem for simple cases 
such as (as was the intention, as as Philip says it is perfectly 
possible to use a list of strings):

a<table>a<table>a<table>a<table>a<table>a<table>a<table>a<table>a etc.

However, given an implementation like that of all three ports of 
html5lib (Python, PHP, Ruby), the following would equally lead to O(n^2) 
behaviour given the above conditions (i.e., a immutable string type like 
that of Python):

a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a etc.

All that the above change to the spec did was require this issue to not 
exist in the case of using a DOM with text foster parented from tables. 
This issue still exists in plenty of other cases where there are tokens 
outputted that are not added to the document.

Also note that fixing it for pure implementations of the spec algorithm 
does not fix all cases, as html5lib still outputs multiple character 
tokens and concatenates them in cases such as:

a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a etc.

In short, I think no change to the spec can actually fix the possibility 
of an O(n^2) algorithm being used in an implementation: implementations, 
as ever, have to take their own care to make sure they are not 
vulnerable to any possible DOS attack.

As reported in <http://www.w3.org/Bugs/Public/show_bug.cgi?id=8239>, it 
is actually impossible for html5lib to implement the spec in all cases, 
as it is possible to use it with tree models that have no concept of 
adjacent text nodes (such as ones that are widely used in Python like 
ElementTree).

Even with tree models that do support adjacent text nodes it adds a lot 
of complexity to fix this single case of string concatenation (as in 
each method (there is one method per node type) that adds a node to the 
tree we need to store that was the last node added to the document, and 
check that every time add a text node to the document.

Furthermore, I don't believe this to be the only way to solve the 
problem in the table case: a better solution would be to just always 
concatenate text nodes by creating a list of strings and concatenating 
them when the next non-text non-table node is inserted, or when the 
table is moved off the stack of open elements if a table follows. This 
would solve the problem for all of the above issues, not just for the 
table foster-parenting issue.

The relative complexity between the solutions is negligible, and the 
latter, as Philip said, is mostly required anyway to avoid issues caused 
by the above examples that do not use table.

Overall, I think the spec gains nothing by requiring adjacent text nodes 
in the table foster parenting case (as it is still very easy for impls 
to use O(n^2) algorithms elsewhere), and that it adds needless 
complexity to implementations (as in some impls there are better ways of 
avoiding O(n^2) string concatenation in all cases, not just in one 
case), as well as making it impossible to implement with some tree 
models. I'd rather the spec went back to the old text, always requiring 
text nodes to be concatenated, though possibly with a note warning that 
implementers should be careful not to use O(n^2) concatenation algorithms.

-- 
Geoffrey Sneddon — Opera Software
<http://gsnedders.com/>
<http://www.opera.com/>
Received on Wednesday, 11 November 2009 12:13:51 UTC