[whatwg] Control and Undefined Characters from Cameron Zemek on 2012-10-10 (public-whatwg-archive@w3.org from October 2012)

From: Cameron Zemek <grom@zeminvaders.net>
Date: Thu, 11 Oct 2012 08:43:58 +1000
To: whatwg@whatwg.org
Message-ID: <CAJnenoWun=k10MaD=s1At6=q5FhE3xf8NynCgs2MXJKH8Hwr3w@mail.gmail.com>

The spec states:
"Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
errors. These are all control characters or permanently undefined
Unicode characters (noncharacters)."

Additionally character references for these codepoints also will
return these unicode characters. Therefore these characters are passed
to the tree construction stage as far as I can tell. And I so no
handling of them in the tree contruction.

Elsewhere in the specification it says:
"Text nodes and attribute values must consist of Unicode characters,
must not contain U+0000 characters, must not contain permanently
undefined Unicode characters (noncharacters), and must not contain
control characters other than space characters."

And testing in Firefox and Chrome it appears these characters are
ignored. But I see no mention of this anywhere to ignore them or how
to handle them.

Is this a bug with the specification?

Received on Wednesday, 10 October 2012 22:44:27 UTC