Re: Speculative tokenization and foreign content from Jonas Sicking on 2008-12-11 (public-html@w3.org from December 2008)

From: Jonas Sicking <jonas@sicking.cc>
Date: Thu, 11 Dec 2008 09:59:04 -0800
To: Sam Ruby <rubys@us.ibm.com>
CC: Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
Message-ID: <494154E8.8090705@sicking.cc>

Sam Ruby wrote:
> Henri Sivonen <hsivonen@iki.fi> wrote on 12/10/2008 06:45:13 PM:
>  >
>  > document.write is not the problem here. There's a problem with  
>  > speculating past <svg> ... <style> even when no document.writes occur.
>  >
>  > If we arrive at <style> without seeing <svg> or <math> before it, we  
>  > know for sure that the tokenizer goes into CDATA variant of the data  
>  > state next. However, if we see a <style> start tag after having seen  
>  > <svg> or <math>, we don't (trivially) know if actually performing the  
>  > tree building would have bailed out of foreign content before reaching  
>  > <style>. Therefore, we don't know if the tokenizer should go into the  
>  > CDATA or PCDATA variant of the data state for continued speculation.
>  >
>  > The obvious course of action is to stop saving the tokens from that  
>  > point onwards even if still looking for more src values to GET with  
>  > less accuracy, but it would be nice to be able to do better.
> 
> Speculative evaluation of instruction streams on a modern CPU given the 
> presence of conditional branch instructions doesn't mean determining 
> with certainty the correct path every time, it simply means getting it 
> right enough of the time to make a difference.
> 
> Even if you can't reliably determine if you "would have bailed out", you 
> might be able to do better than the rather pessimistic approach 
> mentioned above.  Considerably better.  The design of HTML 5 is focused 
> on robustness, even in the face of errors, and even if those errors are 
> relatively infrequent.  A simple approximation: <svg> or <math> starts 
> foreign content, </svg> and </math> stops foreign content may be right 
> enough of the time to make a difference.  You still would have to decide 
> what to do with nesting, and how to detect whether the prediction was 
> incorrect (i.e., any time after the tree builder bails even once, it 
> must stop trusting the token stream at the point it encounters a <style> 
> tag).

Indeed. This problem does not make it impossible to do speculative 
parsing, nor does it make it impossible to most of the time guess right 
on what mode the tokenizer should be put in.

However the harder we try to speculate, the more complex and slow our 
code will be. In your above strategy we'll have to also have to figure 
out when our speculation was wrong and throw away an appropriate number 
of tokens. So for example knowing that any time there is a 
<foreignObject> we'd potentially need to throw all tokens after that away.

Ideally the tokenizer can be made independent of the parser, in which 
case we can be very sure we are tokenizing correctly (the only exception 
is if document.write writes out half a token in which case all tokens 
would need to be thrown away).

However if that can't be done, the more we can reduce the cases where 
the tokenizer depends on the parser, the less complexity and performance 
we'll pay in the implementation.

/ Jonas

Received on Thursday, 11 December 2008 17:57:43 UTC