[whatwg] Forbidden characters in text/html from Henri Sivonen on 2006-03-19 (public-whatwg-archive@w3.org from March 2006)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sun, 19 Mar 2006 18:29:25 +0200
Message-ID: <9F9F83F5-3B50-40F8-BB2A-CE9A7C3C6400@iki.fi>
On Mar 11, 2006, at 03:21, Ian Hickson wrote:

> On Sat, 25 Feb 2006, Henri Sivonen wrote:
>>
>> On Feb 25, 2006, at 02:02, Ian Hickson wrote:
>>
>>> On Sat, 23 Jul 2005, Henri Sivonen wrote:
>>>>
>>>> Which characters should a text/html HTML5 conformance checker  
>>>> consider
>>>> forbidden? The same characters that are forbidden in XML 1.0  
>>>> (\0, FF,
>>>> etc.)? Or some other set?
>>>
>>> In what context?
>>
>> In the pre-parse Unicode character stream on one hand and in the
>> post-parse (that is NCRs expanded) character data and attribute  
>> values
>> on the other. IIRC, in XML 1.0 (but not 1.1) the restrictions are the
>> same in both cases.
>
> Well, the spec says to drop U+0000, and do something with U+000D  
> such that
> U+000D never appears in the parse stream; the post-parse is just  
> the DOM.
>
> Does that answer your question?

Sorry, still going on about this:

Since U+0000 has no legitimate reason to be there just to get  
dropped, is any encounter of U+0000 a parse error?

The way the spec is written, U+000D does not occur in the character  
stream immediately before tokenization, but (as in XML!) it *can*  
appear in the tree construction stage, because an NCR can expand into  
U+000D. (I'm not suggesting any changes here--just noting how it is.)

Since U+000D can occur in the tree construction stage, I think the  
point under "8.2.2.3.7. How to handle tokens in the main phase" that  
says "A character token that is one of one of U+0009 CHARACTER  
TABULATION, U+000A LINE FEED (LF), U+000B LINE TABULATION, U+000C  
FORM FEED (FF), or U+0020 SPACE" should include U+000D as well.

On the other hand, I am wondering why the list of characters that  
implements the concept of whitespace in the tokenization and tree  
contruction stages includes U+000B LINE TABULATION and U+000C FORM  
FEED (FF). Are they required for backwards-compatibility? I would  
guess they do not actually show up on the Web that often. According  
to the W3C Validator, those characters do not need to be allowed for  
formal backwards compatibility with HTML4--on the contrary.
http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest% 
2Fform-feed-in-tag.html
http://validator.w3.org/check?uri=http%3A%2F%2Fhsivonen.iki.fi%2Ftest% 
2Fline-tabulation-in-tag.html

In order to make all conforming HTML5 documents serializable as  
XHTML5, it would be necessary to have a catch-all restriction stating  
that a document is non-conforming if parsing it causes a non-XML  
character ( http://www.w3.org/TR/REC-xml/#NT-Char ) to appear in the  
DOM. For clarity, it would be nice to have the same restriction on  
the pre-parse character stream, but such a restriction is not  
strictly necessary for XHTML-serializability.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
Received on Sunday, 19 March 2006 08:29:25 UTC