- From: Geoffrey Sneddon <gsneddon@opera.com>
- Date: Wed, 11 Nov 2009 13:12:58 +0100
- To: Ian Hickson <ian@hixie.ch>
- CC: public-html@w3.org, pjt47@cam.ac.uk
On Fri, 29 Aug 2008, Ian Hickson wrote:
> On Sat, 31 May 2008, Philip Taylor wrote:
>>
>> Consider a document like:
>>
>> a</li>b</li>c</li>d...
>>
>> The spec says that should be a single text node "abcd...". IE 6,
>> Firefox 3 and Safari 3 do that; Opera 9.5 instead splits it into
>> lots of text nodes.
>>
>> Also consider:
>>
>> a<table>b<td></td>c<td></td>d...
>>
>> The spec says that should be a single text node "abcd...", then a
>> table. Firefox 3 and Safari 3 don't do that, and instead split it
>> into lots of adjacent text nodes. (IE and Opera don't do table
>> foster parenting like this at all.)
>>
>> In many implementations (particularly when strings are immutable,
>> e.g. in Python), constructing the string "abcd..." by concatenating
>> lots of individual characters one at a time has cost O(n^2) in the
>> length of the string.
>>
>> That problem can (I think) be avoided in the first case without too
>> much trouble: the parser's "append text node to current node"
>> method can build up a list of strings, and then they can be flushed
>> (concatenate all strings, append result to current node) once the
>> parser calls a different method ("append element to current node"
>> etc). But I can't see an adequately efficient way to avoid the
>> problem in the table case, particularly if scripts are allowed to
>> observe and modify the DOM as it is being parsed.
>>
>> So this is a DOS vulnerability, since an attacker can do 'n' work
>> sending a document to someone and cause them to do O(n^2) work
>> parsing it (even without scripting).
>>
>> Rather than relying on implementors to notice and fix this problem
>> in incompatible ways, the spec should be changed to require
>> coalescence only in the cases where the parser is consecutively
>> appending text nodes to the current node and not in any other
>> cases.
>
> I've tried to fix this. Please let me know if the fix (r2124) works
> for you. I'm not sure exactly what the implications of my fix are.
The change made to the spec makes the relevant text say:
> When the steps below require the UA to insert a character into a
> node, if that node has a child immediately before where the character
> is to be inserted, and that child is a Text node, and that Text node
> was the last node that the parser inserted into the document, then
> the character must be appended to that Text node; otherwise, a new
> Text node whose data is just that character must be inserted in the
> appropriate place.
This leads, as far as I can tell, to having adjacent text nodes only in
the case where text from within a table is foster parented (ignoring
cases where text nodes are explicitly inserted into the DOM via
scripting). As such, this will only solve the problem for simple cases
such as (as was the intention, as as Philip says it is perfectly
possible to use a list of strings):
a<table>a<table>a<table>a<table>a<table>a<table>a<table>a<table>a etc.
However, given an implementation like that of all three ports of
html5lib (Python, PHP, Ruby), the following would equally lead to O(n^2)
behaviour given the above conditions (i.e., a immutable string type like
that of Python):
a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a etc.
All that the above change to the spec did was require this issue to not
exist in the case of using a DOM with text foster parented from tables.
This issue still exists in plenty of other cases where there are tokens
outputted that are not added to the document.
Also note that fixing it for pure implementations of the spec algorithm
does not fix all cases, as html5lib still outputs multiple character
tokens and concatenates them in cases such as:
a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a etc.
In short, I think no change to the spec can actually fix the possibility
of an O(n^2) algorithm being used in an implementation: implementations,
as ever, have to take their own care to make sure they are not
vulnerable to any possible DOS attack.
As reported in <http://www.w3.org/Bugs/Public/show_bug.cgi?id=8239>, it
is actually impossible for html5lib to implement the spec in all cases,
as it is possible to use it with tree models that have no concept of
adjacent text nodes (such as ones that are widely used in Python like
ElementTree).
Even with tree models that do support adjacent text nodes it adds a lot
of complexity to fix this single case of string concatenation (as in
each method (there is one method per node type) that adds a node to the
tree we need to store that was the last node added to the document, and
check that every time add a text node to the document.
Furthermore, I don't believe this to be the only way to solve the
problem in the table case: a better solution would be to just always
concatenate text nodes by creating a list of strings and concatenating
them when the next non-text non-table node is inserted, or when the
table is moved off the stack of open elements if a table follows. This
would solve the problem for all of the above issues, not just for the
table foster-parenting issue.
The relative complexity between the solutions is negligible, and the
latter, as Philip said, is mostly required anyway to avoid issues caused
by the above examples that do not use table.
Overall, I think the spec gains nothing by requiring adjacent text nodes
in the table foster parenting case (as it is still very easy for impls
to use O(n^2) algorithms elsewhere), and that it adds needless
complexity to implementations (as in some impls there are better ways of
avoiding O(n^2) string concatenation in all cases, not just in one
case), as well as making it impossible to implement with some tree
models. I'd rather the spec went back to the old text, always requiring
text nodes to be concatenated, though possibly with a note warning that
implementers should be careful not to use O(n^2) concatenation algorithms.
--
Geoffrey Sneddon — Opera Software
<http://gsnedders.com/>
<http://www.opera.com/>
Received on Wednesday, 11 November 2009 12:13:51 UTC