- From: Simon Pieters <simonp@opera.com>
- Date: Wed, 26 Mar 2008 13:03:51 +0100
- To: "public-html@w3.org" <public-html@w3.org>
We were fixing our bugs regarding reparsing, but were a bit scared to fix reparsing of comments and escaped text spans, so I asked in #whatwg if someone could be kind enough to provide some data on the matter... Philip` found 128 pages with open "<!--" out of ~130K pages, listed in http://philip.html5.org/data/pages-with-unclosed-comments.txt . I looked through the first 82 pages. 40 of those would work better if we reparse, 1 would work slightly worse, and the rest would be unaffected. This means that about 0.05% of pages would break if we didn't reparse. Opera currently doesn't reparse comments in limited/no quirks mode, but a few pages below break in Opera because of that. (We still reparse open escaped text spans even in no quirks mode.) Also found during this research was that a lot of pages use --!> and expect it to close the comment. --!> closes comments in WebKit and Gecko. We'll probably make --!> close comments given this data. We will probably not stop reparsing comments (in quirks mode) or escaped text spans (at least for script and style), at least not until other browsers do so. Maybe we can limit reparsing of escaped text spans to quirks mode, but we don't particularly like parsing differences between modes. From a security perspective: most servers that filter out unsafe stuff do so with regexps (which has other flaws, of course) and so don't care whether "<script" was found inside or outside a comment. Filters based on html5lib or similar can just assume that comments are unsafe. Escaped text span in <style> might be a bit trickier though -- perhaps filters that want to allow <style> should strip any "<!--"s in <style>s. What reparsing means in the different cases: EOF in comment: Rewind to the character after the first ">" in the comment's data, emit the comment with the characters up to that ">", and then reparse what comes after in the data state. EOF in escaped text span: Rewind to the last "-" character that caused the escaped text span flag switch to true, switch the flag to false, and reparse from there. For <script>, the already executed flag is *not* set to false. (In Firefox the reparsing only happens once -- if a new escaped text span is opened in the reparse then EOF will not cause another reparse. IE however seems to reparse several times if needed. Test case: <textarea><!--&--><!--&</textarea>EOF vs. <textarea><!--&<!--&</textarea>EOF) Supporting data: Pages that have open escaped text spans in <script> and would work better if we reparse: http://brianyeedds.com/ http://cyrilvictor.photo.free.fr/ http://home.houston.rr.com/augandjc/jc.html http://homepage.eircom.net/~doylesbandbrosslare/ http://jlc31.free.fr/AstroScope/ http://www.arkansasbaptist.edu/ (limited quirks) http://www.assosequestrati.it/ http://www.breckskishop.net/ http://www.bunbegholidayhomes.com/ http://www.cairorugby.com/ http://www.christiesphotographic.com/ http://www.columbustubi.com/ (limited quirks) http://www.expatries-suisses.com/ https://www.football1x2.com/ http://www.graficamente-online.it/artproject/index.htm http://www.hotelpanamericano.cl/ http://www.insieme-gr.ch/ Pages that have open escaped text spans in <style> and would work better if we reparse: http://www.3dwebstudios.com/ http://www3.sympatico.ca/sharmink/ http://www.afriendofthebride.com/ http://www.cp-modelagency.nl/ http://www.gartencenter-domatems.ch/ (has <style><!-- --><!--</style>) http://www.gut-hohenkamp.de/ Pages that would work better if --!> would end comments (if we don't reparse): http://duckriver.ponyclub.org/ http://www1.admissions.uga.edu/index.html (limited quirks) http://www.afterstep.org/afterimage/ http://www.alcubilladeavellaneda.com/ http://www.altico.net/ (limited quirks) http://www.andreas-taxis.de/ http://www.daanorthwest.com/ http://www.dateapplication.com/ http://www.growing.com/nonviolent/index.htm Pages that would work better if --!> would end comments (even if we reparse): http://eagle3.american.edu/~cs0218a/ http://www.7sinslounge.org/ Pages that have open comments and would work better if we reparse: http://clubleonberg.free.fr/ http://www.angelfire.com/art/fmedieval/ http://www.astratool.com/ http://www.atholestill.com/scripts/asi.pl?id=&subid=Tenor&pers=0157 http://www.catholic.net/RCC/News/Time_Mag/popetime.html http://www.cs.umd.edu/projects/impact/ Pages that have open comments and would work better if we *don't* reparse: http://www.bryanadams.nu/ (not really broken but it looks like it was intended to comment out the form at the end) -- Simon Pieters Opera Software
Received on Wednesday, 26 March 2008 12:04:44 UTC