Re: Speculative tokenization and foreign content from Henri Sivonen on 2008-12-10 (public-html@w3.org from December 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 10 Dec 2008 15:45:13 -0800
To: Sam Ruby <rubys@us.ibm.com>
Cc: HTML WG <public-html@w3.org>
Message-Id: <6A41B8FB-2B4C-4B97-9646-949C0F918DF1@iki.fi>

On Dec 10, 2008, at 15:13, Sam Ruby wrote:
> Just so I'm not misunderstanding, the purpose of speculative  
> evaluation is to optimize for the normal case; as long as it doesn't  
> impose a severe performance penalty on the non-normal case, on  
> balance, all is good.  Right?
>
Right. The first goal is allow linked resource loads start early even  
when without speculation parsing would block on a script. This aspect  
is implemented on Gecko trunk (and in IE8 and WebKit, I gather). The  
new secondary goal is to avoid throwing away what else could be  
learned while working towards the first goal of speculation.
> You already are willing to take the risk that document.write  
> completes on a token boundary.
>
The risk is that document.write *doesn't* complete at a token  
boundary, but yes, document.write is a risk that we must be willing to  
take. (The assumption here is that most document.writes are sane in  
the sense that they aren't nested and do complete at token boundary.)
> Initially, couldn't you make the conservative assumption that the  
> document.write doesn't introduce any unmatched math or svg tags?  In  
> other words, if the document.write causes new math context to be  
> opened, or a pre-existing svg context to be closed, you would treat  
> this condition the same way you treat document.write which finishes  
> on a non-token boundary.
>

document.write is not the problem here. There's a problem with  
speculating past <svg> ... <style> even when no document.writes occur.

If we arrive at <style> without seeing <svg> or <math> before it, we  
know for sure that the tokenizer goes into CDATA variant of the data  
state next. However, if we see a <style> start tag after having seen  
<svg> or <math>, we don't (trivially) know if actually performing the  
tree building would have bailed out of foreign content before reaching  
<style>. Therefore, we don't know if the tokenizer should go into the  
CDATA or PCDATA variant of the data state for continued speculation.

The obvious course of action is to stop saving the tokens from that  
point onwards even if still looking for more src values to GET with  
less accuracy, but it would be nice to be able to do better.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 10 December 2008 23:45:58 UTC