Re: Speculative tokenization and foreign content from Sam Ruby on 2008-12-10 (public-html@w3.org from December 2008)

From: Sam Ruby <rubys@us.ibm.com>
Date: Wed, 10 Dec 2008 18:13:40 -0500
To: Henri Sivonen <hsivonen@iki.fi>
Cc: HTML WG <public-html@w3.org>
Message-ID: <OF32845C46.C248B376-ON8525751B.007EAA37-8525751B.007F981A@us.ibm.com>

Henri Sivonen wrote on 12/10/2008 02:40:48 PM:
>
> When the HTML parser blocks on a script, new and upcoming browsers
> speculatively scan the stream forward for other script (or potentially
> image and style) src URLs to load.
>
> Going forward, we'd like to do the speculation using the real HTML5
> tokenizer in Gecko and keep the tokens. That is, the speculatively
> created tokens would only be thrown away if a script does a
> document.write() that a) leaves the tokenizer in a state that
> indicates that document.write didn't finish at a token boundary or b)
> document.write() returns before its argument has been fully tokenized.
> (The current speculative parsing thread on Gecko trunk wastes the
> tokens it creates.)
>
> We've identified a problem with speculating past <math> or <svg>: When
> the tree builder is in foreign content, the tokenization phase works
> differently--specifically, the content after <![CDATA[, <script>,
> <style>, <title>, <textarea>, <xmp>, <iframe>, <noembed>, <noframes>,
> <noscript> and <plaintext> is tokenized differently. However, the
> tokenizer can't know when foreign content ends without letting the
> tree builder run synchronously. Making <xmp>, <iframe>, <noembed>,
> <noframes>, <noscript> and <plaintext> break out of foreign content
> wouldn't make the issue go away for the particularly probable cases of
> <![CDATA[, <script>, <style>, <title> and <textarea>.
>
> Letting the tree builder run speculatively isn't a good solution,
> because it would raise the stakes to throwing away speculative tree
> builder output on *any* document.write(). Making the tree builder exit
> foreign content based on simpler rules doesn't appear to be a good
> idea due to the very reason why the exiting is now specified the way
> it is (for resilience against cargo cult junk).
>
> We haven't come up with a good way to speculate past <![CDATA[,
> <script>, <style>, <title>, <textarea>, <xmp>, <iframe>, <noembed>,
> <noframes>, <noscript> and <plaintext> when <math> or <svg> has been
> seen.
>
> This email doesn't contain a spec change suggestion. Instead, the
> purpose of this email is to ask for ideas on how to solve this issue.
> Does anyone have ideas on how to deal with this?

Just so I'm not misunderstanding, the purpose of speculative evaluation is
to optimize for the normal case; as long as it doesn't impose a severe
performance penalty on the non-normal case, on balance, all is good.
Right?

You already are willing to take the risk that document.write completes on a
token boundary.  Initially, couldn't you make the conservative assumption
that the document.write doesn't introduce any unmatched math or svg tags?
In other words, if the document.write causes new math context to be opened,
or a pre-existing svg context to be closed, you would treat this condition
the same way you treat document.write which finishes on a non-token
boundary..

- Sam Ruby


> --
> Henri Sivonen
> hsivonen@iki.fi
> http://hsivonen.iki.fi/
>
>
>

Received on Wednesday, 10 December 2008 23:14:43 UTC