RE: Speculative tokenization and foreign content from Travis Leithead on 2008-12-10 (public-html@w3.org from December 2008)

From: Travis Leithead <Travis.Leithead@microsoft.com>
Date: Wed, 10 Dec 2008 12:03:37 -0800
To: Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
Message-ID: <0003CB8B8FE2154EB50431DB2B8F69C019483FDEDA@NA-EXMSG-W601.wingroup.windeploy.ntd>

>From implementation experience in IE8, we don't tokenize the speculative stream--though this would be another interesting performance option.

-----Original Message-----
From: public-html-request@w3.org [mailto:public-html-request@w3.org] On Behalf Of Henri Sivonen
Sent: Wednesday, December 10, 2008 11:41 AM
To: HTML WG
Subject: Speculative tokenization and foreign content

When the HTML parser blocks on a script, new and upcoming browsers
speculatively scan the stream forward for other script (or potentially
image and style) src URLs to load.

Going forward, we'd like to do the speculation using the real HTML5
tokenizer in Gecko and keep the tokens. That is, the speculatively
created tokens would only be thrown away if a script does a
document.write() that a) leaves the tokenizer in a state that
indicates that document.write didn't finish at a token boundary or b)
document.write() returns before its argument has been fully tokenized.
(The current speculative parsing thread on Gecko trunk wastes the
tokens it creates.)

We've identified a problem with speculating past <math> or <svg>: When
the tree builder is in foreign content, the tokenization phase works
differently--specifically, the content after <![CDATA[, <script>,
<style>, <title>, <textarea>, <xmp>, <iframe>, <noembed>, <noframes>,
<noscript> and <plaintext> is tokenized differently. However, the
tokenizer can't know when foreign content ends without letting the
tree builder run synchronously. Making <xmp>, <iframe>, <noembed>,
<noframes>, <noscript> and <plaintext> break out of foreign content
wouldn't make the issue go away for the particularly probable cases of
<![CDATA[, <script>, <style>, <title> and <textarea>.

Letting the tree builder run speculatively isn't a good solution,
because it would raise the stakes to throwing away speculative tree
builder output on *any* document.write(). Making the tree builder exit
foreign content based on simpler rules doesn't appear to be a good
idea due to the very reason why the exiting is now specified the way
it is (for resilience against cargo cult junk).

We haven't come up with a good way to speculate past <![CDATA[,
<script>, <style>, <title>, <textarea>, <xmp>, <iframe>, <noembed>,
<noframes>, <noscript> and <plaintext> when <math> or <svg> has been
seen.

This email doesn't contain a spec change suggestion. Instead, the
purpose of this email is to ask for ideas on how to solve this issue.
Does anyone have ideas on how to deal with this?

--
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 10 December 2008 20:03:43 UTC