Re: Issues arising from not reparsing

On Sat, 17 Oct 2009 11:40:48 +0300, Simon Pieters <simonp@opera.com> wrote:

>> We still need the <!--/--> magic nonsense in other elements, right? Like
>> <textarea><!--</textarea>--></textarea>?
>
> No, I think only script needs the magic.

Short story: confirmed.


Long story:

http://philip.html5.org/data/cdata-containing-self-close.txt

This contains the textContent of textarea, xmp etc elements from Philip's  
dotbot sites, using the V.nu HTML parser, for elements that contain the  
string "</xmp" for xmp and so forth.

For RCDATA elements, there are some false positives because the site could  
have used &lt;/textarea instead of </textarea, but it's easy to exclude  
these by looking at whether they're inside an escaped text span or not.

http://simon.html5.org/dump/cdata-containing-self-close.xml is the same  
data but with a set of alternative style sheets (one for each element) so  
I could analyze it easily.


<title>
26 occurrences, some of which &lt; false positives. Looks like encoding  
problem eating the "<" character in "</title", which puts the rest of the  
file inside the title. Solvable by getting the encoding right.


<textarea>
Just &lt; false positives.


<noscript>
It looks like V.nu was run with "scripting disabled" so <noscript> was  
parsed as PCDATA. The occurrences here are cases like  
<noscript><iframe></noscript>. These are useless for the purposes of this  
research.


<noframes>
One occurrence, which has  
<noframes><body><script><!--...</script></body></noframes></html>, so  
would be fine either way but would be theoretically nicer to not support  
escapedness.


<style>
52 occurrences.

Most are of the pattern <style><!--...</style>...more markup..., which  
means style shouldn't have escapedness. Having escapedness can eat up the  
rest of the page for some of these pages, which is really bad. Examples:
sonidos.lopeor.com/sonidos/?pnum=5&id_cat=6&ord=1
www.pornotubeitalia.com/200709/anteprima-video-di-anzia.php
www.geomatics.ncku.edu.tw/download/.soft/6/Buying-xanax.html
www.itarunsearch.com/willowxx400/nikki.html
etc (those were just from the first 9)

www.kaneva.com/channel/Whistler.people includes bogus markup in the middle  
of an escaped style block.

www.thecomeupboard.com/forum/showthread.php?t=7436&page=6 includes a  
"nested" style block, like so  
<style><!--...<style><!--...--!></style>--></style>.

www.arabshome.com/vb/showthread.php?t=2517 has  
<style><!--...</style>...--></style> where ... is CSS that seems to be  
intended to be applied; this seems incompatible with the common pattern of  
just forgetting the trailing -->.

Some pages like  
www.cuentayrazon.org/modules.php?op=modload&name=Publications&file=index&p_op=showtopic&secid=17&topicid=37  
have some Word HTML inside <style>.

www.mudherclub.com/forum/showthread.php?t=1922
tl4demo.com/vb/calendar.php?do=add&c=1&day=2008-9-7
www.w15w.com/vb/showthread.php?goto=lastpost&t=48162
bbs.panhwa.info/forum/userinfo.asp?name=trueallen
have <style><!--...</style>--></style>, which would show the characters  
--> without escapedness; incompatible with the common pattern.

kraftylibrarian.com/2005_01_01_archive.html is interesting: it has  
<style><!--...</style><!-- --><style>@import ...</style>.

holdon.blog.hr/2007/05/1622690040/hoces-li-se-sjetit.html has  
<style>...<style><!--...</style><!-- --></style>. Works equally good with  
or without escapedness.

70030.netministry.com/apps/articles/default.asp?articleid=34628&columnid=3803  
has <style>...<!--[if IE]><style>...</style><![endif]--></style>. I'm  
pretty sure conditional comments don't work inside <style> in IE. Works  
equally good with or without escapedness.


<xmp>
No occurrences.


<iframe>
33 occurences. Looks like all of these have an encoding problem eating up  
the first dash in "-->" in  
<iframe>...<!--iframeGIBBERISH->...<!--/iframeGIBBERISH->...</iframe>.  
Fixing the encoding problem would make these work equally good with or  
without escapedness; having the encoding problem they would work better  
without escapedness.


<noembed>
No occurrences.


Conclusion

None of these should have escapedness magic. <style> is problematic but  
would not really benefit from the same treatment we have for <script>  
because the patterns are different. I can't come up with something that  
would make the problematic pages work better. If <style> supports  
escapedness, then there are some pages that go blank because they forget  
to include the -->. Not supporting escapedness results in minor breakage  
for some pages, such as displaying "-->", or reveal some style rules or  
content that was really intended to be hidden (hard to say what was  
intended).

Another conclusion is that it's important to get the encoding right and  
V.nu doesn't in some cases.


Further research

We could rerun the same data collection but this time with "scripting  
enabled" so that we can get data for <noscript>, and also include <script>  
so we could get more accurate results than the regexp searches, in order  
to find out whether the double escape algorithm can be tweaked somehow for  
better compat or less complexity.

-- 
Simon Pieters
Opera Software

Received on Sunday, 25 October 2009 11:04:08 UTC