Re: Conditional branch in tree builder based on DOM state from Jonas Sicking on 2009-01-03 (public-html@w3.org from January 2009)

From: Jonas Sicking <jonas@sicking.cc>
Date: Sat, 3 Jan 2009 18:47:40 +0100
To: "Maciej Stachowiak" <mjs@apple.com>
Cc: "Ian Hickson" <ian@hixie.ch>, "Henri Sivonen" <hsivonen@iki.fi>, "HTMLWG WG" <public-html@w3.org>
Message-ID: <63df84f0901030947k2d57dac2ued4806cb8da399dc@mail.gmail.com>
On Sat, Jan 3, 2009 at 6:13 PM, Maciej Stachowiak <mjs@apple.com> wrote:
>
> On Jan 2, 2009, at 6:31 PM, Jonas Sicking wrote:
>
>> On Fri, Jan 2, 2009 at 6:21 PM, Maciej Stachowiak <mjs@apple.com> wrote:
>>>
>>> On Jan 2, 2009, at 5:09 AM, Jonas Sicking wrote:
>>>
>>>>
>>>> I agree that this would complicate a parser implementation that
>>>> doesn't do speculative parsing while blocked for <script>s. However
>>>> the extra complication is very small compared to the complication
>>>> added to speculative parsers. Additionally, it does simplify the
>>>> interface between the parser and the DOM since with this and the other
>>>> change that has been proposed there is no need to read data back from
>>>> the DOM. This interface simplification is likely about the same as
>>>> adding an extra flag to each node in the tag stack.
>>>>
>>>> Also, any competitive browser is going to have to do speculative
>>>> parsing. The performance gains from doing so is so substantial that
>>>> not doing it is not really an option.
>>>
>>> WebKit trunk does speculative parsing but we don't require anything like
>>> this (since we don't care about reusing the tokens - tokenizing is cheap
>>> -
>>> and we haven't cared so far about multithreaded parsing). In fact it
>>> seems
>>> to me it would make things more complicated if we were required to
>>> distinguish between parser-added and script-added children at parse time.
>>
>> Sorry, I got mixed up regarding why we made this request.
>>
>> Do note though that you don't need to distinguish between parser-added
>> children and script-added children in the node though. All you need to
>> do is add information to the stack of open elements regarding if the
>> parser has added children to the node or not. It's likely that the
>> stack already contains information other than the node itself, such as
>> if the node is a formatting node or not, so adding an extra bit should
>> be no work.
>
> I'll tentatively agree that iit doesn't sound like a big complication (we do
> indeed have info besides the node in the stack of open elements). I would
> have to ask one of our parsing experts to be sure. One thing I wonder about:
> would this require some new behavior to be implemented for the case where a
> node has children, but not parser-added children?

Looking at the spec I tried to figure out what this whole 'check if
current node has children' thing is used for but couldn't immediately
figure it out, so I'll let Henri reply to this one.

>> There is a small amount of work required to set this extra bit to true
>> though any time the parser adds a child to the node. However this
>> amount of work hardly seems enough to sacrifice the ability to do
>> off-main-thread parsing.
>
> It does sound like a good goal to support doing significant pieces of work
> on a separate thread, if the changes required are minor. In this case
> though, it sounds like off-the-main-thread parsing can at least in theory be
> done without any changes to the algorithm, though in a slightly roundabout
> way.

I think the cost in both code complexity and performance are fairly
substantial. You'd end up paying some performance cost even in the
(presumably very rare case) when DOM nodes are inserted since you have
to record rewind information just in case. You'd have to somehow be
able to rewind to the state exactly where the decision is made, maybe
by cloning the whole parser as well as start recording all the data
that is tokenized, or recording the stream of tokens, so that it can
be replayed in case of a rewind.

All this would have to be done any time you hit the situation where
you are reading from the DOM, whether you end up needing to rewind or
not.

There is a big risk that this would be too expensive in a single-core
case where the benefits of multi threaded parsing is much smaller. So
potentially you'd end up having to run the parser in two modes, one
for multi-core machines and one for single core, adding yet more code.
I would expect that to be the case.

> One thing I am uncertain of here is the actual benefit of parsing on a
> separate thread. Do you have any performance results from your prototoype
> work so far?

We don't have any running code yet. We're currently in the process of
getting our existing parser to run off a separate thread, once that is
done we'll have some data. I'll keep you posted.

/ Jonas
Received on Saturday, 3 January 2009 18:01:42 UTC