Re: Tree construction: Coalescing text nodes from Henri Sivonen on 2009-11-11 (public-html@w3.org from November 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 11 Nov 2009 14:31:33 +0200
To: Geoffrey Sneddon <gsneddon@opera.com>
Cc: Ian Hickson <ian@hixie.ch>, public-html@w3.org, pjt47@cam.ac.uk
Message-Id: <04051A0C-F24F-433A-9C3A-AC85408C8CCC@iki.fi>

On Nov 11, 2009, at 14:12, Geoffrey Sneddon wrote:

> However, given an implementation like that of all three ports of html5lib (Python, PHP, Ruby), the following would equally lead to O(n^2) behaviour given the above conditions (i.e., a immutable string type like that of Python):
> 
> a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a</x>a etc.

Because you flush text upon token as opposed to flushing text upon append? Or because the accumulation buffer in the tree builder is an immutable string?

Surely the issue of a mutable accumulation buffer has to be a solved problem in these languages.

> All that the above change to the spec did was require this issue to not exist in the case of using a DOM with text foster parented from tables. This issue still exists in plenty of other cases where there are tokens outputted that are not added to the document.

Only if the tree builder flushes lazily upon non-text token instead of flushing lazily upon non-text insertion. (Well, you are still left with uselessly split text nodes in the AAA if you are not careful. Apparently I haven't been careful enough there.)

> Also note that fixing it for pure implementations of the spec algorithm does not fix all cases, as html5lib still outputs multiple character tokens and concatenates them in cases such as:
> 
> a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a<>a etc.

You can fix this by tokenizing the input in chunks and using the knowledge that you never emit more code units per chunk (at least in UTF-16) than the size of the chunk. You can then expand your mutable buffers for the worst case at the start of the tokenizer chunk by resizing them to have as much free space as the tokenizer chunk is long. This way you resize buffers at most once per chunk.

> As reported in <http://www.w3.org/Bugs/Public/show_bug.cgi?id=8239>, it is actually impossible for html5lib to implement the spec in all cases, as it is possible to use it with tree models that have no concept of adjacent text nodes (such as ones that are widely used in Python like ElementTree).

The failure of putting adjacent text nodes in ElementTree probably isn't interop-sensitive even if text nodes in browsers that support scripting may be. On the flip side, the V.nu streaming SAX mode splits text nodes into arbitrary calls to the characters() callback (in accordance with the SAX API contract).

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 11 November 2009 12:32:15 UTC