- From: Geoffrey Sneddon <gsneddon@opera.com>
- Date: Wed, 11 Nov 2009 13:12:58 +0100
- To: Ian Hickson <ian@hixie.ch>
- CC: public-html@w3.org, pjt47@cam.ac.uk
On Fri, 29 Aug 2008, Ian Hickson wrote: > On Sat, 31 May 2008, Philip Taylor wrote: >> >> Consider a document like: >> >> a</li>b</li>c</li>d... >> >> The spec says that should be a single text node "abcd...". IE 6, >> Firefox 3 and Safari 3 do that; Opera 9.5 instead splits it into >> lots of text nodes. >> >> Also consider: >> >> a<table>b<td></td>c<td></td>d... >> >> The spec says that should be a single text node "abcd...", then a >> table. Firefox 3 and Safari 3 don't do that, and instead split it >> into lots of adjacent text nodes. (IE and Opera don't do table >> foster parenting like this at all.) >> >> In many implementations (particularly when strings are immutable, >> e.g. in Python), constructing the string "abcd..." by concatenating >> lots of individual characters one at a time has cost O(n^2) in the >> length of the string. >> >> That problem can (I think) be avoided in the first case without too >> much trouble: the parser's "append text node to current node" >> method can build up a list of strings, and then they can be flushed >> (concatenate all strings, append result to current node) once the >> parser calls a different method ("append element to current node" >> etc). But I can't see an adequately efficient way to avoid the >> problem in the table case, particularly if scripts are allowed to >> observe and modify the DOM as it is being parsed. >> >> So this is a DOS vulnerability, since an attacker can do 'n' work >> sending a document to someone and cause them to do O(n^2) work >> parsing it (even without scripting). >> >> Rather than relying on implementors to notice and fix this problem >> in incompatible ways, the spec should be changed to require >> coalescence only in the cases where the parser is consecutively >> appending text nodes to the current node and not in any other >> cases. > > I've tried to fix this. Please let me know if the fix (r2124) works > for you. I'm not sure exactly what the implications of my fix are. The change made to the spec makes the relevant text say: > When the steps below require the UA to insert a character into a > node, if that node has a child immediately before where the character > is to be inserted, and that child is a Text node, and that Text node > was the last node that the parser inserted into the document, then > the character must be appended to that Text node; otherwise, a new > Text node whose data is just that character must be inserted in the > appropriate place. This leads, as far as I can tell, to having adjacent text nodes only in the case where text from within a table is foster parented (ignoring cases where text nodes are explicitly inserted into the DOM via scripting). As such, this will only solve the problem for simple cases such as (as was the intention, as as Philip says it is perfectly possible to use a list of strings): a<table>a<table>a<table>a<table>a<table>a<table>a<table>a<table>a etc. However, given an implementation like that of all three ports of html5lib (Python, PHP, Ruby), the following would equally lead to O(n^2) behaviour given the above conditions (i.e., a immutable string type like that of Python): a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a etc. All that the above change to the spec did was require this issue to not exist in the case of using a DOM with text foster parented from tables. This issue still exists in plenty of other cases where there are tokens outputted that are not added to the document. Also note that fixing it for pure implementations of the spec algorithm does not fix all cases, as html5lib still outputs multiple character tokens and concatenates them in cases such as: a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a etc. In short, I think no change to the spec can actually fix the possibility of an O(n^2) algorithm being used in an implementation: implementations, as ever, have to take their own care to make sure they are not vulnerable to any possible DOS attack. As reported in <http://www.w3.org/Bugs/Public/show_bug.cgi?id=8239>, it is actually impossible for html5lib to implement the spec in all cases, as it is possible to use it with tree models that have no concept of adjacent text nodes (such as ones that are widely used in Python like ElementTree). Even with tree models that do support adjacent text nodes it adds a lot of complexity to fix this single case of string concatenation (as in each method (there is one method per node type) that adds a node to the tree we need to store that was the last node added to the document, and check that every time add a text node to the document. Furthermore, I don't believe this to be the only way to solve the problem in the table case: a better solution would be to just always concatenate text nodes by creating a list of strings and concatenating them when the next non-text non-table node is inserted, or when the table is moved off the stack of open elements if a table follows. This would solve the problem for all of the above issues, not just for the table foster-parenting issue. The relative complexity between the solutions is negligible, and the latter, as Philip said, is mostly required anyway to avoid issues caused by the above examples that do not use table. Overall, I think the spec gains nothing by requiring adjacent text nodes in the table foster parenting case (as it is still very easy for impls to use O(n^2) algorithms elsewhere), and that it adds needless complexity to implementations (as in some impls there are better ways of avoiding O(n^2) string concatenation in all cases, not just in one case), as well as making it impossible to implement with some tree models. I'd rather the spec went back to the old text, always requiring text nodes to be concatenated, though possibly with a note warning that implementers should be careful not to use O(n^2) concatenation algorithms. -- Geoffrey Sneddon — Opera Software <http://gsnedders.com/> <http://www.opera.com/>
Received on Wednesday, 11 November 2009 12:13:51 UTC