- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Sun, 19 Mar 2006 18:29:25 +0200
On Mar 11, 2006, at 03:21, Ian Hickson wrote: > On Sat, 25 Feb 2006, Henri Sivonen wrote: >> >> On Feb 25, 2006, at 02:02, Ian Hickson wrote: >> >>> On Sat, 23 Jul 2005, Henri Sivonen wrote: >>>> >>>> Which characters should a text/html HTML5 conformance checker >>>> consider >>>> forbidden? The same characters that are forbidden in XML 1.0 >>>> (\0, FF, >>>> etc.)? Or some other set? >>> >>> In what context? >> >> In the pre-parse Unicode character stream on one hand and in the >> post-parse (that is NCRs expanded) character data and attribute >> values >> on the other. IIRC, in XML 1.0 (but not 1.1) the restrictions are the >> same in both cases. > > Well, the spec says to drop U+0000, and do something with U+000D > such that > U+000D never appears in the parse stream; the post-parse is just > the DOM. > > Does that answer your question? Sorry, still going on about this: Since U+0000 has no legitimate reason to be there just to get dropped, is any encounter of U+0000 a parse error? The way the spec is written, U+000D does not occur in the character stream immediately before tokenization, but (as in XML!) it *can* appear in the tree construction stage, because an NCR can expand into U+000D. (I'm not suggesting any changes here--just noting how it is.) Since U+000D can occur in the tree construction stage, I think the point under "8.2.2.3.7. How to handle tokens in the main phase" that says "A character token that is one of one of U+0009 CHARACTER TABULATION, U+000A LINE FEED (LF), U+000B LINE TABULATION, U+000C FORM FEED (FF), or U+0020 SPACE" should include U+000D as well. On the other hand, I am wondering why the list of characters that implements the concept of whitespace in the tokenization and tree contruction stages includes U+000B LINE TABULATION and U+000C FORM FEED (FF). Are they required for backwards-compatibility? I would guess they do not actually show up on the Web that often. According to the W3C Validator, those characters do not need to be allowed for formal backwards compatibility with HTML4--on the contrary. http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest% 2Fform-feed-in-tag.html http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest% 2Fline-tabulation-in-tag.html In order to make all conforming HTML5 documents serializable as XHTML5, it would be necessary to have a catch-all restriction stating that a document is non-conforming if parsing it causes a non-XML character ( http://www.w3.org/TR/REC-xml/#NT-Char ) to appear in the DOM. For clarity, it would be nice to have the same restriction on the pre-parse character stream, but such a restriction is not strictly necessary for XHTML-serializability. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Sunday, 19 March 2006 08:29:25 UTC