Re: Tree construction: Coalescing text nodes

Henri Sivonen wrote:
> On Nov 11, 2009, at 14:12, Geoffrey Sneddon wrote:
> 
>> However, given an implementation like that of all three ports of html5lib (Python, PHP, Ruby), the following would equally lead to O(n^2) behaviour given the above conditions (i.e., a immutable string type like that of Python):
>>
>> a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a etc.
> 
> Because you flush text upon token as opposed to flushing text upon append? Or because the accumulation buffer in the tree builder is an immutable string?
> 
> Surely the issue of a mutable accumulation buffer has to be a solved problem in these languages.

This is a problem in html5lib because we flush text upon token, and this 
is what a naïve implementation of the current spec will probably lead to 
(I expect the majority of people, if they were to try and write a HTML 5 
parser, who had no experience in writing parsers before, as is the case 
with html5lib, would make a similar mistake), so although the spec 
explicitly does handling to avoid one case of string concatenation, 
unless you carefully think about possible attacks, it's still trivial to 
make it possible to attack in other ways, such as the above.

> 
>> All that the above change to the spec did was require this issue to not exist in the case of using a DOM with text foster parented from tables. This issue still exists in plenty of other cases where there are tokens outputted that are not added to the document.
> 
> Only if the tree builder flushes lazily upon non-text token instead of flushing lazily upon non-text insertion. (Well, you are still left with uselessly split text nodes in the AAA if you are not careful. Apparently I haven't been careful enough there.)

Which is likely would will happen with a naïve implementation of the spec.

> 
>> Also note that fixing it for pure implementations of the spec algorithm does not fix all cases, as html5lib still outputs multiple character tokens and concatenates them in cases such as:
>>
>> a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a etc.
> 
> You can fix this by tokenizing the input in chunks and using the knowledge that you never emit more code units per chunk (at least in UTF-16) than the size of the chunk. You can then expand your mutable buffers for the worst case at the start of the tokenizer chunk by resizing them to have as much free space as the tokenizer chunk is long. This way you resize buffers at most once per chunk.

Indeed, or in the above case, if you re-write the algorithm to always 
move forward in the input stream (and never move back and reprocess a 
character), that can easily become a single string token.

>> As reported in <http://www.w3.org/Bugs/Public/show_bug.cgi?id=8239>, it is actually impossible for html5lib to implement the spec in all cases, as it is possible to use it with tree models that have no concept of adjacent text nodes (such as ones that are widely used in Python like ElementTree).
> 
> The failure of putting adjacent text nodes in ElementTree probably isn't interop-sensitive even if text nodes in browsers that support scripting may be. On the flip side, the V.nu streaming SAX mode splits text nodes into arbitrary calls to the characters() callback (in accordance with the SAX API contract).

But is splitting arbitrarily even conforming?

-- 
Geoffrey Sneddon — Opera Software
<http://gsnedders.com/>
<http://www.opera.com/>

Received on Friday, 13 November 2009 09:35:45 UTC