Re: Issues arising from not reparsing from Simon Pieters on 2009-10-05 (public-html@w3.org from October 2009)

From: Simon Pieters <simonp@opera.com>
Date: Mon, 05 Oct 2009 18:34:42 +0200
To: "Henri Sivonen" <hsivonen@iki.fi>, "Ian Hickson" <ian@hixie.ch>
Cc: "HTMLWG WG" <public-html@w3.org>
Message-ID: <op.u1b5n4l2idj3kv@zcorpandell.linkoping.osa>
On Thu, 13 Aug 2009 09:26:39 +0200, Henri Sivonen <hsivonen@iki.fi> wrote:

> On Aug 12, 2009, at 22:55, Ian Hickson wrote:
>
>> On Wed, 12 Aug 2009, Henri Sivonen wrote:
>>> On Aug 12, 2009, at 12:10, Henri Sivonen wrote:
>>>
>>>> I think I'll create a wiki page with requirements and a proposed delta
>>>> spec first, though, because others on #whatwg were interested in
>>>> pondering alternative solutions given a set of requirements.
>>>
>>> Wiki page created: http://wiki.whatwg.org/wiki/CDATA_Escapes
>>
>> Wow. Please can we stick to just the current magic escapes and not add
>> even more magic?
>
> The current magic without all the magic that current browsers implement  
> lead to some incompatibilities with existing content. I don't know how  
> often a user would hit these issues, but when the problems do occur,  
> they wreck the whole page. Therefore, I think we should seriously try to  
> improve the magic so that it substitutes the current browser magic  
> better in practice while still not doing reparsing.

http://philip.html5.org/data/script-open-in-escape.txt has 622 pages.

http://philip.html5.org/data/script-close-in-escape-without-script-open-2.txt  
has 708 pages.

Most of these look like they would break with what's currently specced.

The two sets might overlap. Some of the pages are not relevant, because  
the extract might appear inside an HTML comment. The breakage can be up to  
around 1300 pages out of 425000.


The common pattern is:

A.
<script><!--
...
//--></script>


However, there are several patterns that break with that is currently  
specced:

B.
<script><!--
...
</script>

C.
<script><!--
...
//-->
<!--</script>

D.
<script><!--
...
//-- ></script>

E.
<script><!--
...
//- -></script>

F.
<script><!--
...
//- - ></script>

G.
<script><!--
...
//-></script>

etc.


where ... can be

   1. document.write('<script></script>');
   2. document.write('<script></script><script></script>');
   3. document.write('<script></script>');  
document.write('<script></script>');
   4. document.write('<script>'); document.write('</script>');
   5. document.write('<scr'+'ipt></scr'+'ipt>');
   6. document.write('<scr'+'ipt></script>');
   7. document.write('<script></scr'+'ipt>');


Proposal #3 in http://wiki.whatwg.org/wiki/CDATA_Escapes reads:

    For script, when in an escaped text span, set a flag after having seen
    "<script" followed by whitespace or slash or greater-than. "</script"
    followed by whitespace or slash or greater-than only closes the element
    if the flag is not set, and otherwise emits the text and resets the
    flag. Exiting an escaped text span also resets the flag.


It breaks with (6) combined with any of A-G. I found 3 sites doing this.

www.grandparents.com/gp/content/expert-advice/family-matters/article/thatevildaughterinlaw.html
www.celebrity-link.com/c106/showcelebrity_categoryid-10687.html
me.yaplog.jp/viewBoard.blog?boardId=975

It also breaks for (7) combined with B or D-G (note that what's currently  
specced also breaks here). I found 1 site doing this.

www.jeuxactu.com/images-fiche-soul-calibur-legends-8219-4-6.html


The sites appear to have one or two (or three) pages with the relevant  
script. This makes proposal #3 break for something on the order of 10  
pages out of 425000. This is surprisingly close to the current behavior of  
doing reparsing. (Not reparsing leads to better performance since you  
don't need to wait for the whole page to have loaded before deciding where  
the script should end, and it doesn't have the security issue.)

I can't come up with a different proposal that breaks less pages.


> Here are points that need research, in my opinion:
>
>   1) Would removing the escape flag from xmp, title and textarea improve  
> or degrade Web compat given no reparsing? To research this, I suggest  
> parsing a substantial body of Web content with the current parsing  
> algorithm and then grepping the text content of every xmp element for  
> |<!--.*</xmp| (ignoring case and letting . match over line breaks).  
> (Likewise for textarea and title, except rejecting hits where any part  
> of "<!--" or "</title" has been entity-escaped.) Basically, if there are  
> almost no hits, it would be safer to zap the escape flag from these  
> elements, because accidentally having <!-- eat up the rest of the page  
> is worse than terminating one of these element prematurely very rarely.

Not researched yet. I haven't really thought about what treatment the  
other (R)CDATA elements should have.


>   2) Would making comments and escape runs close on --\s+!> improve or  
> degrade Web compat given no reparsing? To research this, I suggest  
> grepping |--\s+!>| a substantial body of Web content and analyzing the  
> hits.
>
>   3) Would making --!> and --\s+> close escapes improve or degrade Web  
> compat given no reparsing? To research this, I suggest parsing a  
> substantial body of Web content with the current parsing algorithm and  
> then grepping the text content of every script and style element for  
> |--!>| and |--\s+>| and analyzing the hits.

http://philip.html5.org/data/script-close-in-escape-without-script-open-2.txt  
has a few pages with --!> in script, but they would work anyway with  
proposal #3.


>   4) Would making <!-- not open an espace when there's non-whitespace on  
> the line before it improve or degrade Web compat given no reparsing? To  
> research this, I suggest parsing a substantial body of Web content with  
> the current parsing algorithm and then grepping the text content of  
> every script and style element for |^.*\S.*<!--| and analyzing the hits.

Most pages from the data have <!-- with just whitespace on the line before  
it, but then have no close comment or has written it with the wrong  
syntax, or has the close comment but then opens up another escape just  
before the end tag.


> Hixie, have you already run these analyses? If not, it would be awesome  
> if someone who already maintains the capability to run these searches  
> could run them. (I volunteer to perform the "analyze the hits" parts,  
> but I don't currently have the readiness to run the searches.)

Thanks to Philip` for running searches.

-- 
Simon Pieters
Opera Software
Received on Monday, 5 October 2009 16:35:29 UTC