- From: Simon Sapin <simon.sapin@kozea.fr>
- Date: Thu, 24 Jan 2013 21:34:24 +0100
- To: www-style list <www-style@w3.org>
Hi, css3-syntax defines "non-ASCII character" as "A character with a codepoint equal to or greater than U+00A0 NO-BREAK SPACE." Could we change that to "… equal to or greater than U+0080"? In other words, having the whole block of C1 control characters be considered non-ASCII. I don’t think that more than a tiny amount of existing content would be affected, but I don’t have any data to support that. Maybe I’m still a decade too late, I don’t know. Why? First, it’s weird. In any definition that I can find, ASCII stops at most at 0x7F. I understand the feeling that control characters should not be part of identifiers, but then why exclude them but not Unicode’s many other non-characters, whitespace or punctuation? Let’s not go there, handling non-ASCII uniformly is much simpler. This peculiar definition of non-ASCII does not seem to have a reason, other that being a remain of CSS1 where the only non-ASCII characters are [¡-ÿ], the "printable" part of Latin-1. Perhaps more importantly, this change would make some implementation strategies easier. If all non-ASCII characters are treated the same, a tokenizer could use UTF-8 bytes as its internal representation of text and work by only looking at individual bytes. Sequences of non-ASCII codepoints map 1:1 to sequences of non-ASCII bytes in UTF-8. With "non-ASCII" starting at U+00A0 however this is not so easy, because U+0080~U+009F has to be discriminated from other multi-byte UTF-8 sequences. Thoughts? Is this too small of a concern to make the change? -- Simon Sapin
Received on Thursday, 24 January 2013 20:35:16 UTC