- From: Ian Hickson <ian@hixie.ch>
- Date: Thu, 7 Jun 2007 00:46:00 +0000 (UTC)
On Sun, 19 Mar 2006, Henri Sivonen wrote: > > Since U+0000 has no legitimate reason to be there just to get dropped, > is any encounter of U+0000 a parse error? Yes. Fixed. > The way the spec is written, U+000D does not occur in the character > stream immediately before tokenization, but (as in XML!) it *can* appear > in the tree construction stage, because an NCR can expand into U+000D. > (I'm not suggesting any changes here--just noting how it is.) Indeed. > Since U+000D can occur in the tree construction stage, I think the point > under "8.2.2.3.7. How to handle tokens in the main phase" that says "A > character token that is one of one of U+0009 CHARACTER TABULATION, > U+000A LINE FEED (LF), U+000B LINE TABULATION, U+000C FORM FEED (FF), or > U+0020 SPACE" should include U+000D as well. Good point. Fixed. > On the other hand, I am wondering why the list of characters that > implements the concept of whitespace in the tokenization and tree > contruction stages includes U+000B LINE TABULATION and U+000C FORM FEED > (FF). Are they required for backwards-compatibility? I would guess they > do not actually show up on the Web that often. According to the W3C > Validator, those characters do not need to be allowed for formal > backwards compatibility with HTML4--on the contrary. > http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest%2Fform-feed-in-tag.html > http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest%2Fline-tabulation-in-tag.html I don't have an opinion about U+000B. What would you want changed? U+000C is allowed because converting text files to HTML can easily end up inserting FF characters. (e.g. RFCs have FF characters, conversion to HTML often leaves them.) I see no harm in allowing them really. > In order to make all conforming HTML5 documents serializable as XHTML5, > it would be necessary to have a catch-all restriction stating that a > document is non-conforming if parsing it causes a non-XML character ( > http://www.w3.org/TR/REC-xml/#NT-Char ) to appear in the DOM. For > clarity, it would be nice to have the same restriction on the pre-parse > character stream, but such a restriction is not strictly necessary for > XHTML-serializability. I don't really think we can guarentee that all conforming HTML5 documents be serializable as XHTML5 anyway. I'm reluctant to add catch-all clauses, because they tend to have unexpected consequences. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 6 June 2007 17:46:00 UTC