[whatwg] Parse errors for invalid characters from Geoffrey Sneddon on 2013-09-05 (public-whatwg-archive@w3.org from September 2013)

From: Geoffrey Sneddon <foolistbar@googlemail.com>
Date: Thu, 05 Sep 2013 15:08:04 -0700
To: WHAT Working Group <whatwg@whatwg.org>
Message-ID: <522900C4.2050409@googlemail.com>

The phrasing content section states:

> Text nodes and attribute values must consist of Unicode characters, must not contain U+0000 characters, must not contain permanently undefined Unicode characters (noncharacters), and must not contain control characters other than space characters. This specification includes extra constraints on the exact value of Text nodes and attribute values depending on their precise context.

And the pre-processing the input-stream section states:

> Any occurrences of any characters in the ranges U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).

Note the first uses "Unicode characters", the second "characters" — the 
former excludes surrogates as a conformance requirement.

Note that every disallowed non-surrogate character is a parse error.

Therefore, it would make sense to make surrogates parse errors.

It should be noted that they can only occur in the input stream if they 
come from script (as they cannot be decoded from the input byte stream 
as the decoders will never emit a surrogate).

/Geoffrey.

Received on Thursday, 5 September 2013 22:08:31 UTC