- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 6 Sep 2007 15:05:56 +0300
- To: Simon Pieters <simonp@opera.com>
- Cc: public-html <public-html@w3.org>
On Sep 1, 2007, at 18:47, Simon Pieters wrote: > (This is part of my detailed review of the Writing HTML documents > section.) > > The spec says about optional tags: > > An html element's end tag may be omitted if the html element is not > immediately followed by a space character or a comment. > > However, since spaces after the </html> tag will be inserted into > the html element by the parser anyway, the </html> tag might well > be allowed to be omitted when it is followed by space characters > (but not when those in turn are followed by a comment). Is this something that must happen for legacy compat, or should we, for now, treat the handling of space characters after </body> and </ html> as bugs in the parsing spec? After all, the spec is rather broken when it comes to handling the end of the document. Both html5lib and the Validator.nu HTML parser take liberties to do what seems to be the right thing with EOF handling where the spec fails. Echoing my comments on foster parenting text[1], I suggest making character token coalescing an explicit part of the tree construction algorithm so that the coalescing buffer coald be flushed as a single text node insertion as the side effect of any other tree mutation. This would enable decisions based on the contents of a coalesced run of text as opposed to individual characters (or UTF-16 code units, rather). Furthermore, when such a run consists entirely of space characters and appears after </body> but before </html>, I suggest appending the text node to the root element. And when the entire run consists of space characters after </html>, I suggest discarding the run (for XML infoset compat). If there are non-space characters in the run, the entire run would be appended to the body. I'm not sure if, for compat with actual Web content, runs of text between comments can be considered independently or whether it is necessary to consider each interleaved run of text and comments as a unit. (In some cases, WebKit seems to get away with doing the former but Gecko does the latter, IIRC.) [1] http://www.w3.org/mid/EFDC5142-B2E6-4D0F-B4B4-ED9FD03CC136@iki.fi -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 6 September 2007 12:06:17 UTC