- From: Simon Pieters <simonp@opera.com>
- Date: Wed, 26 Mar 2008 13:03:51 +0100
- To: "public-html@w3.org" <public-html@w3.org>
We were fixing our bugs regarding reparsing, but were a bit scared to fix
reparsing of comments and escaped text spans, so I asked in #whatwg if
someone could be kind enough to provide some data on the matter...
Philip` found 128 pages with open "<!--" out of ~130K pages, listed in
http://philip.html5.org/data/pages-with-unclosed-comments.txt . I looked
through the first 82 pages. 40 of those would work better if we reparse, 1
would work slightly worse, and the rest would be unaffected. This means
that about 0.05% of pages would break if we didn't reparse.
Opera currently doesn't reparse comments in limited/no quirks mode, but a
few pages below break in Opera because of that. (We still reparse open
escaped text spans even in no quirks mode.)
Also found during this research was that a lot of pages use --!> and
expect it to close the comment. --!> closes comments in WebKit and Gecko.
We'll probably make --!> close comments given this data.
We will probably not stop reparsing comments (in quirks mode) or escaped
text spans (at least for script and style), at least not until other
browsers do so. Maybe we can limit reparsing of escaped text spans to
quirks mode, but we don't particularly like parsing differences between
modes.
From a security perspective: most servers that filter out unsafe stuff do
so with regexps (which has other flaws, of course) and so don't care
whether "<script" was found inside or outside a comment. Filters based on
html5lib or similar can just assume that comments are unsafe. Escaped text
span in <style> might be a bit trickier though -- perhaps filters that
want to allow <style> should strip any "<!--"s in <style>s.
What reparsing means in the different cases:
EOF in comment:
Rewind to the character after the first ">" in the comment's data,
emit the comment with the characters up to that ">", and then reparse what
comes after in the data state.
EOF in escaped text span:
Rewind to the last "-" character that caused the escaped text span
flag switch to true, switch the flag to false, and reparse from there. For
<script>, the already executed flag is *not* set to false. (In Firefox the
reparsing only happens once -- if a new escaped text span is opened in the
reparse then EOF will not cause another reparse. IE however seems to
reparse several times if needed. Test case:
<textarea><!--&--><!--&</textarea>EOF vs.
<textarea><!--&<!--&</textarea>EOF)
Supporting data:
Pages that have open escaped text spans in <script> and would work better
if we reparse:
http://brianyeedds.com/
http://cyrilvictor.photo.free.fr/
http://home.houston.rr.com/augandjc/jc.html
http://homepage.eircom.net/~doylesbandbrosslare/
http://jlc31.free.fr/AstroScope/
http://www.arkansasbaptist.edu/ (limited quirks)
http://www.assosequestrati.it/
http://www.breckskishop.net/
http://www.bunbegholidayhomes.com/
http://www.cairorugby.com/
http://www.christiesphotographic.com/
http://www.columbustubi.com/ (limited quirks)
http://www.expatries-suisses.com/
https://www.football1x2.com/
http://www.graficamente-online.it/artproject/index.htm
http://www.hotelpanamericano.cl/
http://www.insieme-gr.ch/
Pages that have open escaped text spans in <style> and would work better
if we reparse:
http://www.3dwebstudios.com/
http://www3.sympatico.ca/sharmink/
http://www.afriendofthebride.com/
http://www.cp-modelagency.nl/
http://www.gartencenter-domatems.ch/ (has <style><!-- --><!--</style>)
http://www.gut-hohenkamp.de/
Pages that would work better if --!> would end comments (if we don't
reparse):
http://duckriver.ponyclub.org/
http://www1.admissions.uga.edu/index.html (limited quirks)
http://www.afterstep.org/afterimage/
http://www.alcubilladeavellaneda.com/
http://www.altico.net/ (limited quirks)
http://www.andreas-taxis.de/
http://www.daanorthwest.com/
http://www.dateapplication.com/
http://www.growing.com/nonviolent/index.htm
Pages that would work better if --!> would end comments (even if we
reparse):
http://eagle3.american.edu/~cs0218a/
http://www.7sinslounge.org/
Pages that have open comments and would work better if we reparse:
http://clubleonberg.free.fr/
http://www.angelfire.com/art/fmedieval/
http://www.astratool.com/
http://www.atholestill.com/scripts/asi.pl?id=&subid=Tenor&pers=0157
http://www.catholic.net/RCC/News/Time_Mag/popetime.html
http://www.cs.umd.edu/projects/impact/
Pages that have open comments and would work better if we *don't* reparse:
http://www.bryanadams.nu/ (not really broken but it looks like it was
intended to comment out the form at the end)
--
Simon Pieters
Opera Software
Received on Wednesday, 26 March 2008 12:04:44 UTC