W3C home > Mailing lists > Public > whatwg@whatwg.org > October 2012

Re: [whatwg] Control and Undefined Characters

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 10 Oct 2012 23:07:57 +0000 (UTC)
To: Cameron Zemek <grom@zeminvaders.net>
Message-ID: <Pine.LNX.4.64.1210102300000.1904@ps20323.dreamhostps.com>
Cc: whatwg@whatwg.org
On Thu, 11 Oct 2012, Cameron Zemek wrote:
>
> The spec states:
> "Any occurrences of any characters in the ranges U+0001 to U+0008,
> U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
> U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
> U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
> U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
> U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
> U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
> errors. These are all control characters or permanently undefined
> Unicode characters (noncharacters)."
> 
> Additionally character references for these codepoints also will
> return these unicode characters. Therefore these characters are passed
> to the tree construction stage as far as I can tell. And I so no
> handling of them in the tree contruction.
> 
> Elsewhere in the specification it says:
> "Text nodes and attribute values must consist of Unicode characters,
> must not contain U+0000 characters, must not contain permanently
> undefined Unicode characters (noncharacters), and must not contain
> control characters other than space characters."

All these requirements relate to authoring conformance criteria and 
validators.

User agents are required to treat U+0001 the same as, say, "A".


> And testing in Firefox and Chrome it appears these characters are 
> ignored. But I see no mention of this anywhere to ignore them or how to 
> handle them.

Do you have a test case demonstrating this? When I tested it it seemed 
like the characters were not ignored:

   http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=1824

(This test is testing whether a U+0001 is lost either in the JS parser, 
document.write(), the HTML tokeniser, the HTML parser, the DOM API, or the 
JS string API, and it seems to get through all of those fine.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 10 October 2012 23:08:23 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 30 January 2013 18:48:11 GMT