Re: Speculative tokenization and foreign content

On Wed, 10 Dec 2008 20:40:48 +0100, Henri Sivonen <hsivonen@iki.fi> wrote:

> When the HTML parser blocks on a script, new and upcoming browsers  
> speculatively scan the stream forward for other script (or potentially  
> image and style) src URLs to load.
>
> Going forward, we'd like to do the speculation using the real HTML5  
> tokenizer in Gecko and keep the tokens. That is, the speculatively  
> created tokens would only be thrown away if a script does a  
> document.write() that a) leaves the tokenizer in a state that indicates  
> that document.write didn't finish at a token boundary or b)  
> document.write() returns before its argument has been fully tokenized.  
> (The current speculative parsing thread on Gecko trunk wastes the tokens  
> it creates.)
>
> We've identified a problem with speculating past <math> or <svg>: When  
> the tree builder is in foreign content, the tokenization phase works  
> differently--specifically, the content after <![CDATA[, <script>,  
> <style>, <title>, <textarea>, <xmp>, <iframe>, <noembed>, <noframes>,  
> <noscript> and <plaintext> is tokenized differently. However, the  
> tokenizer can't know when foreign content ends without letting the tree  
> builder run synchronously. Making <xmp>, <iframe>, <noembed>,  
> <noframes>, <noscript> and <plaintext> break out of foreign content  
> wouldn't make the issue go away for the particularly probable cases of  
> <![CDATA[, <script>, <style>, <title> and <textarea>.
>
> Letting the tree builder run speculatively isn't a good solution,  
> because it would raise the stakes to throwing away speculative tree  
> builder output on *any* document.write(). Making the tree builder exit  
> foreign content based on simpler rules doesn't appear to be a good idea  
> due to the very reason why the exiting is now specified the way it is  
> (for resilience against cargo cult junk).
>
> We haven't come up with a good way to speculate past <![CDATA[,  
> <script>, <style>, <title>, <textarea>, <xmp>, <iframe>, <noembed>,  
> <noframes>, <noscript> and <plaintext> when <math> or <svg> has been  
> seen.
>
> This email doesn't contain a spec change suggestion. Instead, the  
> purpose of this email is to ask for ideas on how to solve this issue.  
> Does anyone have ideas on how to deal with this?

In http://lists.w3.org/Archives/Public/public-html/2007Oct/0161.html I  
speculated (no pun intended) about having the above-mentioned elements  
keep their (R)CDATAness even in foreign content, and not supporting  
<![CDATA[ at all. However, I guess the self-closing flag would still be a  
problem (<svg> ... <style/> ...)? Perhaps the self-closing flag shouldn't  
work on (R)CDATA elements in foreign content?

Going with this makes SVG less copy-pastable into HTML, which is not good.  
OTOH, when it would break, it means authors have to fix it up, and then it  
would break less in legacy UAs (e.g. replace <script .../> with <script  
...></script>).

Going with this also means that <textArea> in SVG can't contain elements  
declaratively, which is not good.

-- 
Simon Pieters
Opera Software

Received on Thursday, 11 December 2008 09:19:12 UTC