Re: </html> followed by spaces (detailed review of Writing HTML documents) from Henri Sivonen on 2007-09-06 (public-html@w3.org from September 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 6 Sep 2007 15:05:56 +0300
To: Simon Pieters <simonp@opera.com>
Cc: public-html <public-html@w3.org>
Message-Id: <A39090BD-661A-4970-A6CB-5825A714F118@iki.fi>

On Sep 1, 2007, at 18:47, Simon Pieters wrote:

> (This is part of my detailed review of the Writing HTML documents  
> section.)
>
> The spec says about optional tags:
>
>    An html element's end tag may be omitted if the html element is not
>    immediately followed by a space character or a comment.
>
> However, since spaces after the </html> tag will be inserted into  
> the html element by the parser anyway, the </html> tag might well  
> be allowed to be omitted when it is followed by space characters  
> (but not when those in turn are followed by a comment).

Is this something that must happen for legacy compat, or should we,  
for now, treat the handling of space characters after </body> and </ 
html> as bugs in the parsing spec? After all, the spec is rather  
broken when it comes to handling the end of the document. Both  
html5lib and the Validator.nu HTML parser take liberties to do what  
seems to be the right thing with EOF handling where the spec fails.

Echoing my comments on foster parenting text[1], I suggest making  
character token coalescing an explicit part of the tree construction  
algorithm so that the coalescing buffer coald be flushed as a single  
text node insertion as the side effect of any other tree mutation.  
This would enable decisions based on the contents of a coalesced run  
of text as opposed to individual characters (or UTF-16 code units,  
rather).

Furthermore, when such a run consists entirely of space characters  
and appears after </body> but before </html>, I suggest appending the  
text node to the root element. And when the entire run consists of  
space characters after </html>, I suggest discarding the run (for XML  
infoset compat). If there are non-space characters in the run, the  
entire run would be appended to the body.

I'm not sure if, for compat with actual Web content, runs of text  
between comments can be considered independently or whether it is  
necessary to consider each interleaved run of text and comments as a  
unit. (In some cases, WebKit seems to get away with doing the former  
but Gecko does the latter, IIRC.)

[1] http://www.w3.org/mid/EFDC5142-B2E6-4D0F-B4B4-ED9FD03CC136@iki.fi
-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 6 September 2007 12:06:17 UTC