[css3-syntax] Making U+0080 to U+009F "non-ASCII"? from Simon Sapin on 2013-01-24 (www-style@w3.org from January 2013)

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Thu, 24 Jan 2013 21:34:24 +0100
To: www-style list <www-style@w3.org>
Message-ID: <51019AD0.7040403@kozea.fr>

Hi,

css3-syntax defines "non-ASCII character" as "A character with a 
codepoint equal to or greater than U+00A0 NO-BREAK SPACE."

Could we change that to "… equal to or greater than U+0080"? In other 
words, having the whole block of C1 control characters be considered 
non-ASCII. I don’t think that more than a tiny amount of existing 
content would be affected, but I don’t have any data to support that. 
Maybe I’m still a decade too late, I don’t know.


Why? First, it’s weird. In any definition that I can find, ASCII stops 
at most at 0x7F. I understand the feeling that control characters should 
not be part of identifiers, but then why exclude them but not Unicode’s 
many other non-characters, whitespace or punctuation? Let’s not go 
there, handling non-ASCII uniformly is much simpler.

This peculiar definition of non-ASCII does not seem to have a reason, 
other that being a remain of CSS1 where the only non-ASCII characters 
are [¡-ÿ], the "printable" part of Latin-1.


Perhaps more importantly, this change would make some implementation 
strategies easier. If all non-ASCII characters are treated the same, a 
tokenizer could use UTF-8 bytes as its internal representation of text 
and work by only looking at individual bytes. Sequences of non-ASCII 
codepoints map 1:1 to sequences of non-ASCII bytes in UTF-8.

With "non-ASCII" starting at U+00A0 however this is not so easy, because 
U+0080~U+009F has to be discriminated from other multi-byte UTF-8 sequences.


Thoughts? Is this too small of a concern to make the change?
-- 
Simon Sapin

Received on Thursday, 24 January 2013 20:35:16 UTC