[whatwg] Comment parsing

Included below are some e-mails regarding how to parse comments. They 
point out inconsistencies between browsers and the spec. These 
inconsistencies were known when the spec was written. Browsers aren't 
consistent with each other either. I'd rather leave the parser spec stable 
here for a while to see if we can converge on that (as far as I can tell 
it represents a good compromise along the axes of compatibility, security, 
implementation ease, and maintainability).

If browsers, when they implement HTML5, find that they cannot get good 
enough compatibility with the current spec text, then we should change the 
spec at that point.

On Thu, 26 Jun 2008, Adam Barth wrote:
>
> Recently, I've been testing how browser parsers handle unterminated <!-- 
> comments -->.  Internet Explorer 7, Firefox 3, Safari 3.1, and Opera 9.5 
> agree on the following cases:
> 
> http://crypto.stanford.edu/~abarth/research/html5/comments/open-textarea.html 
> http://crypto.stanford.edu/~abarth/research/html5/comments/open-script.html 
> http://crypto.stanford.edu/~abarth/research/html5/comments/open-style.html
> 
> Essentially, they treat the <!-- as if it did not start a comment. Ian 
> pointed out on IRC that this might be a security vulnerability because 
> the result of parsing the stream depends on whether the parser hung or 
> terminated at the end of the stream.  (If the parser had hung, it would 
> be awaiting more characters for the comment.)
> 
> The above browsers almost agree for on the behavior for <title>:
> 
> http://crypto.stanford.edu/~abarth/research/html5/comments/open-title.html
> 
> Internet Explorer 7, Firefox 3, and Opera 9.5 treat treat <!-- as if it 
> did not start a comment.  Safari 3.1 differs slightly and only uses the 
> portion before the <!-- as the title, but otherwise parses the remainder 
> of the document as if <!-- did not start a comment.
> 
> The above browsers differ in their handling of unterminated comments for 
> the <iframe> element:
> 
> http://crypto.stanford.edu/~abarth/research/html5/comments/open-iframe.html
> 
> Internet Explorer 7 and Safari 3.1 follow the spec and consume the 
> remainder of the document in the comment.  Firefox 3 and Opera 9.5 treat 
> <!-- as if it did not start a comment.
> 
> As I understand it, browser behavior for <textarea>, <script>, <style>, 
> and <title> differs from the spec.  It is unclear whether browsers will 
> change to match the spec, especially because the <script> element might 
> contain <!-- sequences in string literals or regular expressions (e.g., 
> <http://crypto.stanford.edu/~abarth/research/html5/comments/open-script-in-string.html>).

On Thu, 26 Jun 2008, Adam Barth wrote:
>
> Internet Explorer 7, Firefox 3, Safari 3.1, and Opera 9.5 accept --!> as 
> an alternate comment terminator to the usual -->
> 
> http://crypto.stanford.edu/~abarth/research/html5/comments/strange-ending.html
> 
> In Internet Explorer 7 and Opera 9.5, if the document later contains the 
> usual comment terminator, then that character sequence terminates the 
> comment instead:
> 
> http://crypto.stanford.edu/~abarth/research/html5/comments/strange-ending-with-real-ending.html 
> http://crypto.stanford.edu/~abarth/research/html5/comments/strange-ending-with-later-comment.html
> 
> Firefox 3 and Safari 3.1 do not appear to exhibit this behavior.
> 
> (Interestingly, the syntax highlighter in vim suggests the document will 
> be parsed as in Firefox and Safari, no doubt contributing to author 
> confusion.)

On Fri, 27 Jun 2008, Adam Barth wrote:
>
> Ian explained to me on IRC that IE and Opera are consuming the entire 
> document as a comment and reparsing for > (i.e., --!> is not treated 
> specially).  That is supported by the following test case:
> 
> http://crypto.stanford.edu/~abarth/research/html5/comments/bang-gt.html
> 
> Safari and Firefox contain explicit code for detecting --!> (as 
> demonstrated by the above test case).  In Safari, the code was 
> introduced in
> 
> http://trac.webkit.org/changeset/4103
> 
> In Firefox, the code was introduced in
> 
> https://bugzilla.mozilla.org/show_bug.cgi?id=110544
> 
> As far as I can tell, neither checkin explains why this behavior was 
> added.

On Fri, 27 Jun 2008, Maciej Stachowiak wrote:
> 
> Hyatt's comment on the WebKit checkin says it was to match other 
> browsers (presumably Mozilla).

On Fri, 27 Jun 2008, Adam Barth wrote:
>
> It looks like Mozilla is planning to change their behavior to match the 
> HTML5 spec in this regard.  See the patch in 
> <https://bugzilla.mozilla.org/show_bug.cgi?id=214476>.

On Tue, 15 Jul 2008, Jim Jewett wrote:
> 
> That's too bad; I would rather that the spec supported "--!>" while 
> parsing (though not for authoring).
> 
> *I* see it mostly on fairly old pages -- generally in archives, or other 
> places where the original author cannot make a change.
> 
> I notice these pages because I remember a time (err, not this decade) 
> when I wrote most of my own comments that way, because it was 
> recommended by about half the tutorials, it worked on the browsers I 
> could check with (lynx, and I think Mosaic and early netscape) -- and it 
> seemed more consistent because of the symmetry.  (It also allowed the 
> use of "-->" for arrow, but I don't see a good way to compatibly support 
> that.)
> 
> Having a later "-->" turn "--!>" recognition off seems to silently break 
> a fair portion of these older pages, because that is often from a later 
> comment, so that a middle portion of the document is lost.
> 
> Letting any ">" end the comment may or may not be better still.  I do 
> remember that Opera found that strictly enforcing the SGML requirements 
> was a loss, though I don't remember the details. (Something like 
> counting parity on double-hyphens.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Monday, 1 September 2008 21:04:07 UTC