- From: Simon Pieters <simonp@opera.com>
- Date: Mon, 05 Oct 2009 18:34:42 +0200
- To: "Henri Sivonen" <hsivonen@iki.fi>, "Ian Hickson" <ian@hixie.ch>
- Cc: "HTMLWG WG" <public-html@w3.org>
On Thu, 13 Aug 2009 09:26:39 +0200, Henri Sivonen <hsivonen@iki.fi> wrote: > On Aug 12, 2009, at 22:55, Ian Hickson wrote: > >> On Wed, 12 Aug 2009, Henri Sivonen wrote: >>> On Aug 12, 2009, at 12:10, Henri Sivonen wrote: >>> >>>> I think I'll create a wiki page with requirements and a proposed delta >>>> spec first, though, because others on #whatwg were interested in >>>> pondering alternative solutions given a set of requirements. >>> >>> Wiki page created: http://wiki.whatwg.org/wiki/CDATA_Escapes >> >> Wow. Please can we stick to just the current magic escapes and not add >> even more magic? > > The current magic without all the magic that current browsers implement > lead to some incompatibilities with existing content. I don't know how > often a user would hit these issues, but when the problems do occur, > they wreck the whole page. Therefore, I think we should seriously try to > improve the magic so that it substitutes the current browser magic > better in practice while still not doing reparsing. http://philip.html5.org/data/script-open-in-escape.txt has 622 pages. http://philip.html5.org/data/script-close-in-escape-without-script-open-2.txt has 708 pages. Most of these look like they would break with what's currently specced. The two sets might overlap. Some of the pages are not relevant, because the extract might appear inside an HTML comment. The breakage can be up to around 1300 pages out of 425000. The common pattern is: A. <script><!-- ... //--></script> However, there are several patterns that break with that is currently specced: B. <script><!-- ... </script> C. <script><!-- ... //--> <!--</script> D. <script><!-- ... //-- ></script> E. <script><!-- ... //- -></script> F. <script><!-- ... //- - ></script> G. <script><!-- ... //-></script> etc. where ... can be 1. document.write('<script></script>'); 2. document.write('<script></script><script></script>'); 3. document.write('<script></script>'); document.write('<script></script>'); 4. document.write('<script>'); document.write('</script>'); 5. document.write('<scr'+'ipt></scr'+'ipt>'); 6. document.write('<scr'+'ipt></script>'); 7. document.write('<script></scr'+'ipt>'); Proposal #3 in http://wiki.whatwg.org/wiki/CDATA_Escapes reads: For script, when in an escaped text span, set a flag after having seen "<script" followed by whitespace or slash or greater-than. "</script" followed by whitespace or slash or greater-than only closes the element if the flag is not set, and otherwise emits the text and resets the flag. Exiting an escaped text span also resets the flag. It breaks with (6) combined with any of A-G. I found 3 sites doing this. www.grandparents.com/gp/content/expert-advice/family-matters/article/thatevildaughterinlaw.html www.celebrity-link.com/c106/showcelebrity_categoryid-10687.html me.yaplog.jp/viewBoard.blog?boardId=975 It also breaks for (7) combined with B or D-G (note that what's currently specced also breaks here). I found 1 site doing this. www.jeuxactu.com/images-fiche-soul-calibur-legends-8219-4-6.html The sites appear to have one or two (or three) pages with the relevant script. This makes proposal #3 break for something on the order of 10 pages out of 425000. This is surprisingly close to the current behavior of doing reparsing. (Not reparsing leads to better performance since you don't need to wait for the whole page to have loaded before deciding where the script should end, and it doesn't have the security issue.) I can't come up with a different proposal that breaks less pages. > Here are points that need research, in my opinion: > > 1) Would removing the escape flag from xmp, title and textarea improve > or degrade Web compat given no reparsing? To research this, I suggest > parsing a substantial body of Web content with the current parsing > algorithm and then grepping the text content of every xmp element for > |<!--.*</xmp| (ignoring case and letting . match over line breaks). > (Likewise for textarea and title, except rejecting hits where any part > of "<!--" or "</title" has been entity-escaped.) Basically, if there are > almost no hits, it would be safer to zap the escape flag from these > elements, because accidentally having <!-- eat up the rest of the page > is worse than terminating one of these element prematurely very rarely. Not researched yet. I haven't really thought about what treatment the other (R)CDATA elements should have. > 2) Would making comments and escape runs close on --\s+!> improve or > degrade Web compat given no reparsing? To research this, I suggest > grepping |--\s+!>| a substantial body of Web content and analyzing the > hits. > > 3) Would making --!> and --\s+> close escapes improve or degrade Web > compat given no reparsing? To research this, I suggest parsing a > substantial body of Web content with the current parsing algorithm and > then grepping the text content of every script and style element for > |--!>| and |--\s+>| and analyzing the hits. http://philip.html5.org/data/script-close-in-escape-without-script-open-2.txt has a few pages with --!> in script, but they would work anyway with proposal #3. > 4) Would making <!-- not open an espace when there's non-whitespace on > the line before it improve or degrade Web compat given no reparsing? To > research this, I suggest parsing a substantial body of Web content with > the current parsing algorithm and then grepping the text content of > every script and style element for |^.*\S.*<!--| and analyzing the hits. Most pages from the data have <!-- with just whitespace on the line before it, but then have no close comment or has written it with the wrong syntax, or has the close comment but then opens up another escape just before the end tag. > Hixie, have you already run these analyses? If not, it would be awesome > if someone who already maintains the capability to run these searches > could run them. (I volunteer to perform the "analyze the hits" parts, > but I don't currently have the readiness to run the searches.) Thanks to Philip` for running searches. -- Simon Pieters Opera Software
Received on Monday, 5 October 2009 16:35:29 UTC