W3C home > Mailing lists > Public > public-html@w3.org > December 2008

Speculative tokenization and foreign content

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 10 Dec 2008 11:40:48 -0800
Message-Id: <2D7A9E91-0CAE-4C21-B800-A9DBECC2542E@iki.fi>
To: HTML WG <public-html@w3.org>

When the HTML parser blocks on a script, new and upcoming browsers  
speculatively scan the stream forward for other script (or potentially  
image and style) src URLs to load.

Going forward, we'd like to do the speculation using the real HTML5  
tokenizer in Gecko and keep the tokens. That is, the speculatively  
created tokens would only be thrown away if a script does a  
document.write() that a) leaves the tokenizer in a state that  
indicates that document.write didn't finish at a token boundary or b)  
document.write() returns before its argument has been fully tokenized.  
(The current speculative parsing thread on Gecko trunk wastes the  
tokens it creates.)

We've identified a problem with speculating past <math> or <svg>: When  
the tree builder is in foreign content, the tokenization phase works  
differently--specifically, the content after <![CDATA[, <script>,  
<style>, <title>, <textarea>, <xmp>, <iframe>, <noembed>, <noframes>,  
<noscript> and <plaintext> is tokenized differently. However, the  
tokenizer can't know when foreign content ends without letting the  
tree builder run synchronously. Making <xmp>, <iframe>, <noembed>,  
<noframes>, <noscript> and <plaintext> break out of foreign content  
wouldn't make the issue go away for the particularly probable cases of  
<![CDATA[, <script>, <style>, <title> and <textarea>.

Letting the tree builder run speculatively isn't a good solution,  
because it would raise the stakes to throwing away speculative tree  
builder output on *any* document.write(). Making the tree builder exit  
foreign content based on simpler rules doesn't appear to be a good  
idea due to the very reason why the exiting is now specified the way  
it is (for resilience against cargo cult junk).

We haven't come up with a good way to speculate past <![CDATA[,  
<script>, <style>, <title>, <textarea>, <xmp>, <iframe>, <noembed>,  
<noframes>, <noscript> and <plaintext> when <math> or <svg> has been  
seen.

This email doesn't contain a spec change suggestion. Instead, the  
purpose of this email is to ask for ideas on how to solve this issue.  
Does anyone have ideas on how to deal with this?

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 10 December 2008 19:41:37 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:39:00 UTC