W3C home > Mailing lists > Public > www-style@w3.org > January 2013

[css3-syntax] Making U+0080 to U+009F "non-ASCII"?

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Thu, 24 Jan 2013 21:34:24 +0100
Message-ID: <51019AD0.7040403@kozea.fr>
To: www-style list <www-style@w3.org>
Hi,

css3-syntax defines "non-ASCII character" as "A character with a 
codepoint equal to or greater than U+00A0 NO-BREAK SPACE."

Could we change that to "Ö equal to or greater than U+0080"? In other 
words, having the whole block of C1 control characters be considered 
non-ASCII. I donít think that more than a tiny amount of existing 
content would be affected, but I donít have any data to support that. 
Maybe Iím still a decade too late, I donít know.


Why? First, itís weird. In any definition that I can find, ASCII stops 
at most at 0x7F. I understand the feeling that control characters should 
not be part of identifiers, but then why exclude them but not Unicodeís 
many other non-characters, whitespace or punctuation? Letís not go 
there, handling non-ASCII uniformly is much simpler.

This peculiar definition of non-ASCII does not seem to have a reason, 
other that being a remain of CSS1 where the only non-ASCII characters 
are [°-ˇ], the "printable" part of Latin-1.


Perhaps more importantly, this change would make some implementation 
strategies easier. If all non-ASCII characters are treated the same, a 
tokenizer could use UTF-8 bytes as its internal representation of text 
and work by only looking at individual bytes. Sequences of non-ASCII 
codepoints map 1:1 to sequences of non-ASCII bytes in UTF-8.

With "non-ASCII" starting at U+00A0 however this is not so easy, because 
U+0080~U+009F has to be discriminated from other multi-byte UTF-8 sequences.


Thoughts? Is this too small of a concern to make the change?
-- 
Simon Sapin
Received on Thursday, 24 January 2013 20:35:16 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 17:21:04 GMT