Re: Conditional branch in tree builder based on DOM state from Jonas Sicking on 2009-01-02 (public-html@w3.org from January 2009)

From: Jonas Sicking <jonas@sicking.cc>
Date: Fri, 2 Jan 2009 14:09:55 +0100
To: "Ian Hickson" <ian@hixie.ch>
Cc: "Henri Sivonen" <hsivonen@iki.fi>, "HTMLWG WG" <public-html@w3.org>
Message-ID: <63df84f0901020509t2366ffafp5f5f85cac5cefc80@mail.gmail.com>

On Thu, Dec 25, 2008 at 11:59 AM, Ian Hickson <ian@hixie.ch> wrote:
> On Tue, 9 Dec 2008, Henri Sivonen wrote:
>>
>> We had a meeting about moving parsing off the main thread in Gecko. The
>> idea is to make the tree builder output a sequence of tree mutations on
>> a dedicated thread and to actuate the mutations on the main thread.
>>
>> It is then a problem if the sequence of mutations is conditional on
>> information read synchronously from the DOM. There is one place in the
>> HTML5 parsing algorithm where this problem arises and cannot be remedied
>> by deferring the conditional check itself to the mutation actuation
>> time: step 7.5. of the adoption agency algorithm. The changes to the
>> internal state of the tree builder depend on whether a node has
>> children.
>>
>> The problem would go away if instead of checking whether the node has
>> children in the DOM, there were a 'has children' flag on the tree
>> builder stack tracking whether the node has *parser-inserted* children.
>>
>> The spec change suggestion, therefore, is making step 7.5. of the AAA
>> read "If node has any *parser-inserted* children, perform a shallow
>> clone of node——".
>
> I'm vaguely convinced that this wouldn't introduce the possibility of
> crashers in weird cases with hostile scripts, but as I was looking at this
> a second concern came up.
>
> While what you describe would make your particular implementation strategy
> easier, it would in fact complicate implementations that support script
> and operate on the DOM directly (i.e. all script-aware implementations
> today).
>
> Instead, I recommend passing along an _assertion_ along with the
> mutations, along the lines of "these elements have children" and "these
> elements don't have children", which the main thread can quickly check. If
> the assertions turn out to be false, which will only happen in rare caes,
> then the thread can synchronise with the main thread, getting itself
> updated on what nodes have children, and then try again from that point.

I agree that this would complicate a parser implementation that
doesn't do speculative parsing while blocked for <script>s. However
the extra complication is very small compared to the complication
added to speculative parsers. Additionally, it does simplify the
interface between the parser and the DOM since with this and the other
change that has been proposed there is no need to read data back from
the DOM. This interface simplification is likely about the same as
adding an extra flag to each node in the tag stack.

Also, any competitive browser is going to have to do speculative
parsing. The performance gains from doing so is so substantial that
not doing it is not really an option. So while you'd have slightly
simpler HTML parsing in consumers other than browsers, you will make
all browsers more complex. The alternative for browsers is likely to
not use a real tokenizer when doing speculative parsing which means
wasting CPU cycles. Also, something that isn't a browser is likely to
not run scripts, so such a parser could optimize to use the algorithm
in the current spec if it really is an advantage (see above regarding
complexities of feeding data back from the DOM).

Finally, using the algorithm that Henri is suggesting will yield
somewhat more logical results IMHO. Currently if you are modifying the
DOM as it is being loaded you can have race conditions where the
resulting DOM depends on when you make a seemingly unrelated DOM
mutation. If you insert your content before the parser reads from the
DOM you get one result, if you do it after you get another.

/ Jonas

Received on Friday, 2 January 2009 13:10:32 UTC