- From: Tab Atkins Jr. <jackalmage@gmail.com>
- Date: Thu, 24 Jan 2013 13:00:47 -0800
- To: Simon Sapin <simon.sapin@kozea.fr>
- Cc: www-style list <www-style@w3.org>
On Thu, Jan 24, 2013 at 12:34 PM, Simon Sapin <simon.sapin@kozea.fr> wrote: > css3-syntax defines "non-ASCII character" as "A character with a codepoint > equal to or greater than U+00A0 NO-BREAK SPACE." > > Could we change that to "… equal to or greater than U+0080"? In other words, > having the whole block of C1 control characters be considered non-ASCII. I > don’t think that more than a tiny amount of existing content would be > affected, but I don’t have any data to support that. Maybe I’m still a > decade too late, I don’t know. > > > Why? First, it’s weird. In any definition that I can find, ASCII stops at > most at 0x7F. I understand the feeling that control characters should not be > part of identifiers, but then why exclude them but not Unicode’s many other > non-characters, whitespace or punctuation? Let’s not go there, handling > non-ASCII uniformly is much simpler. > > This peculiar definition of non-ASCII does not seem to have a reason, other > that being a remain of CSS1 where the only non-ASCII characters are [¡-ÿ], > the "printable" part of Latin-1. > > > Perhaps more importantly, this change would make some implementation > strategies easier. If all non-ASCII characters are treated the same, a > tokenizer could use UTF-8 bytes as its internal representation of text and > work by only looking at individual bytes. Sequences of non-ASCII codepoints > map 1:1 to sequences of non-ASCII bytes in UTF-8. > > With "non-ASCII" starting at U+00A0 however this is not so easy, because > U+0080~U+009F has to be discriminated from other multi-byte UTF-8 sequences. As I stated in IRC, I suspect (but have no explicit evidence) that the reason we start it at A0 is because 80-9F are non-printable characters, and there's no reason to use them in a CSS value anyway. It's easy to just shift the starting point, so we did so; further isolation of other groups of Unicode non-printing chars is not worth the effort. That said, I agree that it's almost certain that basically no content would be affected by changing the definition to start at 80, so I don't have a problem with doing so. ~TJ
Received on Thursday, 24 January 2013 21:01:36 UTC