Tree construction: Coalescing text nodes

Consider a document like:

   a</li>b</li>c</li>d...

The spec says that should be a single text node "abcd...". IE 6, Firefox 
3 and Safari 3 do that; Opera 9.5 instead splits it into lots of text nodes.

Also consider:

   a<table>b<td></td>c<td></td>d...

The spec says that should be a single text node "abcd...", then a table. 
  Firefox 3 and Safari 3 don't do that, and instead split it into lots 
of adjacent text nodes. (IE and Opera don't do table foster parenting 
like this at all.)


In many implementations (particularly when strings are immutable, e.g. 
in Python), constructing the string "abcd..." by concatenating lots of 
individual characters one at a time has cost O(n^2) in the length of the 
string.

That problem can (I think) be avoided in the first case without too much 
trouble: the parser's "append text node to current node" method can 
build up a list of strings, and then they can be flushed (concatenate 
all strings, append result to current node) once the parser calls a 
different method ("append element to current node" etc). But I can't see 
an adequately efficient way to avoid the problem in the table case, 
particularly if scripts are allowed to observe and modify the DOM as it 
is being parsed.

So this is a DOS vulnerability, since an attacker can do 'n' work 
sending a document to someone and cause them to do O(n^2) work parsing 
it (even without scripting). Rather than relying on implementors to 
notice and fix this problem in incompatible ways, the spec should be 
changed to require coalescence only in the cases where the parser is 
consecutively appending text nodes to the current node and not in any 
other cases.

Hence you would get parse trees like:

  a</li>b</li>c</li>d...
  | <html>
  |   <head>
  |   <body>
  |     "abcd..."

  a<table>b<td></td>c<td></td>d...
  | <html>
  |   <head>
  |   <body>
  |     "a"
  |     "b"
  |     "c"
  |     "d..."
  |     <table> etc

  a<script id=s>
   var s = document.getElementById('s');
   s.parentNode.removeChild(s);
  </script>b
  | <html>
  |   <head>
  |   <body>
  |     "a"
  |     "b"

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Saturday, 31 May 2008 15:46:36 UTC