Reparsing of comments, escaped text spans, and --!>

We were fixing our bugs regarding reparsing, but were a bit scared to fix  
reparsing of comments and escaped text spans, so I asked in #whatwg if  
someone could be kind enough to provide some data on the matter...

Philip` found 128 pages with open "<!--" out of ~130K pages, listed in  
http://philip.html5.org/data/pages-with-unclosed-comments.txt . I looked  
through the first 82 pages. 40 of those would work better if we reparse, 1  
would work slightly worse, and the rest would be unaffected. This means  
that about 0.05% of pages would break if we didn't reparse.

Opera currently doesn't reparse comments in limited/no quirks mode, but a  
few pages below break in Opera because of that. (We still reparse open  
escaped text spans even in no quirks mode.)

Also found during this research was that a lot of pages use --!> and  
expect it to close the comment. --!> closes comments in WebKit and Gecko.  
We'll probably make --!> close comments given this data.

We will probably not stop reparsing comments (in quirks mode) or escaped  
text spans (at least for script and style), at least not until other  
browsers do so. Maybe we can limit reparsing of escaped text spans to  
quirks mode, but we don't particularly like parsing differences between  
modes.

 From a security perspective: most servers that filter out unsafe stuff do  
so with regexps (which has other flaws, of course) and so don't care  
whether "<script" was found inside or outside a comment. Filters based on  
html5lib or similar can just assume that comments are unsafe. Escaped text  
span in <style> might be a bit trickier though -- perhaps filters that  
want to allow <style> should strip any "<!--"s in <style>s.



What reparsing means in the different cases:

    EOF in comment:
      Rewind to the character after the first ">" in the comment's data,  
emit the comment with the characters up to that ">", and then reparse what  
comes after in the data state.

    EOF in escaped text span:
      Rewind to the last "-" character that caused the escaped text span  
flag switch to true, switch the flag to false, and reparse from there. For  
<script>, the already executed flag is *not* set to false. (In Firefox the  
reparsing only happens once -- if a new escaped text span is opened in the  
reparse then EOF will not cause another reparse. IE however seems to  
reparse several times if needed. Test case:  
<textarea><!--&amp;--><!--&amp;</textarea>EOF vs.  
<textarea><!--&amp;<!--&amp;</textarea>EOF)



Supporting data:

Pages that have open escaped text spans in <script> and would work better  
if we reparse:

http://brianyeedds.com/
http://cyrilvictor.photo.free.fr/
http://home.houston.rr.com/augandjc/jc.html
http://homepage.eircom.net/~doylesbandbrosslare/
http://jlc31.free.fr/AstroScope/
http://www.arkansasbaptist.edu/ (limited quirks)
http://www.assosequestrati.it/
http://www.breckskishop.net/
http://www.bunbegholidayhomes.com/
http://www.cairorugby.com/
http://www.christiesphotographic.com/
http://www.columbustubi.com/ (limited quirks)
http://www.expatries-suisses.com/
https://www.football1x2.com/
http://www.graficamente-online.it/artproject/index.htm
http://www.hotelpanamericano.cl/
http://www.insieme-gr.ch/


Pages that have open escaped text spans in <style> and would work better  
if we reparse:

http://www.3dwebstudios.com/
http://www3.sympatico.ca/sharmink/
http://www.afriendofthebride.com/
http://www.cp-modelagency.nl/
http://www.gartencenter-domatems.ch/ (has <style><!-- --><!--</style>)
http://www.gut-hohenkamp.de/


Pages that would work better if --!> would end comments (if we don't  
reparse):

http://duckriver.ponyclub.org/
http://www1.admissions.uga.edu/index.html (limited quirks)
http://www.afterstep.org/afterimage/
http://www.alcubilladeavellaneda.com/
http://www.altico.net/ (limited quirks)
http://www.andreas-taxis.de/
http://www.daanorthwest.com/
http://www.dateapplication.com/
http://www.growing.com/nonviolent/index.htm


Pages that would work better if --!> would end comments (even if we  
reparse):

http://eagle3.american.edu/~cs0218a/
http://www.7sinslounge.org/


Pages that have open comments and would work better if we reparse:

http://clubleonberg.free.fr/
http://www.angelfire.com/art/fmedieval/
http://www.astratool.com/
http://www.atholestill.com/scripts/asi.pl?id=&subid=Tenor&pers=0157
http://www.catholic.net/RCC/News/Time_Mag/popetime.html
http://www.cs.umd.edu/projects/impact/


Pages that have open comments and would work better if we *don't* reparse:

http://www.bryanadams.nu/ (not really broken but it looks like it was  
intended to comment out the form at the end)

-- 
Simon Pieters
Opera Software

Received on Wednesday, 26 March 2008 12:04:44 UTC