Issues arising from not reparsing from Henri Sivonen on 2009-08-10 (public-html@w3.org from August 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 10 Aug 2009 14:14:48 +0300
To: HTMLWG WG <public-html@w3.org>
Message-Id: <E70F006A-1595-4AA3-AC8F-A4CB0F42B01A@iki.fi>

Firefox nightlies have had an HTML5 parser implementation behind a  
pref for a month now. The Web compat issues that have been uncovered  
have been surprisingly few, which is great.

However, there are three Web compat issues that don't have trivial  
fixes. They all are related to the HTML5 parsing algorithm not  
recovering from errors by rewinding the stream and reparsing with  
different rules. As such, if these are treated as bugs, they are spec  
bugs.

  1) When the string "<!--" occurs inside a string literal in  
JavaScript, it starts and escape that hides </script> and the rest of  
the page is eaten into the script.
https://bugzilla.mozilla.org/show_bug.cgi?id=503632

  2) When a script starts with <script><!-- but doesn't end with --></ 
script> (ends with only </script>), the rest of the page is eaten into  
the script.
https://bugzilla.mozilla.org/show_bug.cgi?id=504941

  3) When there's no </title> end tag, the page gets eaten into the  
title.
https://bugzilla.mozilla.org/show_bug.cgi?id=508075
see also
https://bugs.webkit.org/show_bug.cgi?id=3905
https://bugzilla.mozilla.org/show_bug.cgi?id=42945

Personally, I'd like to avoid reparsing if at all possible, because  
it's a security risk and because it complicates the parser.

In case #1, I think the right fix is to introduce more statefulness  
into the escapes so that <!-- and --> that occur inside string  
literals are heuristically ignored. (Anyone care to suggest a  
heuristic that doesn't involve rolling a JS parser into the HTML  
parser?)

For case #2, I can't think of a fix that doesn't involve reparsing.  
Personally, I'd just leave it as WONTFIX and position the change from  
previous browser behavior as a security improvement. (To my great  
surprise, there haven't been reports of this issue with actual  
comments--only with escapes inside inline scripts.)

For case #3, I'd personally like to treat it as WONTFIX, because IE6  
and IE8 both seem to do less recovery here than Gecko and WebKit.  
Therefore, pages that lack </title> are probably already broken in IE,  
so it's unlikely that such pages are common enough to be a big deal on  
the Web scale. (IE seems to recover sometimes but only rarely. I can't  
figure out what the recovery rule is.)

Any thoughts on what the right way to deal with these is?

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 10 August 2009 11:15:32 UTC