- From: Cameron Zemek <grom@zeminvaders.net>
- Date: Tue, 9 Oct 2012 12:47:25 +1000
- To: whatwg@whatwg.org
I noticed the specification usually treats null characters U+0000 by replacing them with the replacement character U+FFFD . The other cases it will be ignored by the tree construction stage when the mode is 'in body', 'in table text', 'in select'. Would it not be simpler and more consistent to just have the Input Stream Preprocessor replace all null characters with the replacement character. I don't see the point in filling the specification with handling of null characters just so sometimes it can be ignored instead of included as a replacement character. And with character references it is replaced with U+FFFD (ie. � becomes the replacement character). If the Input Stream Preprocessor convert them it would result in minimal changes to the output as I believe most HTML documents in the wild do not include null characters. On a similar note why have the other invalid unicode characters, U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF as part of the input stream to the tokenizer and tree construction?
Received on Tuesday, 9 October 2012 02:47:55 UTC