- From: Simon Pieters <simonp@opera.com>
- Date: Sun, 25 Oct 2009 12:03:19 +0200
- To: "Ian Hickson" <ian@hixie.ch>
- Cc: "Henri Sivonen" <hsivonen@iki.fi>, "HTMLWG WG" <public-html@w3.org>
On Sat, 17 Oct 2009 11:40:48 +0300, Simon Pieters <simonp@opera.com> wrote: >> We still need the <!--/--> magic nonsense in other elements, right? Like >> <textarea><!--</textarea>--></textarea>? > > No, I think only script needs the magic. Short story: confirmed. Long story: http://philip.html5.org/data/cdata-containing-self-close.txt This contains the textContent of textarea, xmp etc elements from Philip's dotbot sites, using the V.nu HTML parser, for elements that contain the string "</xmp" for xmp and so forth. For RCDATA elements, there are some false positives because the site could have used </textarea instead of </textarea, but it's easy to exclude these by looking at whether they're inside an escaped text span or not. http://simon.html5.org/dump/cdata-containing-self-close.xml is the same data but with a set of alternative style sheets (one for each element) so I could analyze it easily. <title> 26 occurrences, some of which < false positives. Looks like encoding problem eating the "<" character in "</title", which puts the rest of the file inside the title. Solvable by getting the encoding right. <textarea> Just < false positives. <noscript> It looks like V.nu was run with "scripting disabled" so <noscript> was parsed as PCDATA. The occurrences here are cases like <noscript><iframe></noscript>. These are useless for the purposes of this research. <noframes> One occurrence, which has <noframes><body><script><!--...</script></body></noframes></html>, so would be fine either way but would be theoretically nicer to not support escapedness. <style> 52 occurrences. Most are of the pattern <style><!--...</style>...more markup..., which means style shouldn't have escapedness. Having escapedness can eat up the rest of the page for some of these pages, which is really bad. Examples: sonidos.lopeor.com/sonidos/?pnum=5&id_cat=6&ord=1 www.pornotubeitalia.com/200709/anteprima-video-di-anzia.php www.geomatics.ncku.edu.tw/download/.soft/6/Buying-xanax.html www.itarunsearch.com/willowxx400/nikki.html etc (those were just from the first 9) www.kaneva.com/channel/Whistler.people includes bogus markup in the middle of an escaped style block. www.thecomeupboard.com/forum/showthread.php?t=7436&page=6 includes a "nested" style block, like so <style><!--...<style><!--...--!></style>--></style>. www.arabshome.com/vb/showthread.php?t=2517 has <style><!--...</style>...--></style> where ... is CSS that seems to be intended to be applied; this seems incompatible with the common pattern of just forgetting the trailing -->. Some pages like www.cuentayrazon.org/modules.php?op=modload&name=Publications&file=index&p_op=showtopic&secid=17&topicid=37 have some Word HTML inside <style>. www.mudherclub.com/forum/showthread.php?t=1922 tl4demo.com/vb/calendar.php?do=add&c=1&day=2008-9-7 www.w15w.com/vb/showthread.php?goto=lastpost&t=48162 bbs.panhwa.info/forum/userinfo.asp?name=trueallen have <style><!--...</style>--></style>, which would show the characters --> without escapedness; incompatible with the common pattern. kraftylibrarian.com/2005_01_01_archive.html is interesting: it has <style><!--...</style><!-- --><style>@import ...</style>. holdon.blog.hr/2007/05/1622690040/hoces-li-se-sjetit.html has <style>...<style><!--...</style><!-- --></style>. Works equally good with or without escapedness. 70030.netministry.com/apps/articles/default.asp?articleid=34628&columnid=3803 has <style>...<!--[if IE]><style>...</style><![endif]--></style>. I'm pretty sure conditional comments don't work inside <style> in IE. Works equally good with or without escapedness. <xmp> No occurrences. <iframe> 33 occurences. Looks like all of these have an encoding problem eating up the first dash in "-->" in <iframe>...<!--iframeGIBBERISH->...<!--/iframeGIBBERISH->...</iframe>. Fixing the encoding problem would make these work equally good with or without escapedness; having the encoding problem they would work better without escapedness. <noembed> No occurrences. Conclusion None of these should have escapedness magic. <style> is problematic but would not really benefit from the same treatment we have for <script> because the patterns are different. I can't come up with something that would make the problematic pages work better. If <style> supports escapedness, then there are some pages that go blank because they forget to include the -->. Not supporting escapedness results in minor breakage for some pages, such as displaying "-->", or reveal some style rules or content that was really intended to be hidden (hard to say what was intended). Another conclusion is that it's important to get the encoding right and V.nu doesn't in some cases. Further research We could rerun the same data collection but this time with "scripting enabled" so that we can get data for <noscript>, and also include <script> so we could get more accurate results than the regexp searches, in order to find out whether the double escape algorithm can be tweaked somehow for better compat or less complexity. -- Simon Pieters Opera Software
Received on Sunday, 25 October 2009 11:04:08 UTC