Re: [css3-syntax] Making U+0080 to U+009F "non-ASCII"? from Tab Atkins Jr. on 2013-01-24 (www-style@w3.org from January 2013)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Thu, 24 Jan 2013 13:00:47 -0800
To: Simon Sapin <simon.sapin@kozea.fr>
Cc: www-style list <www-style@w3.org>
Message-ID: <CAAWBYDDbBwnndPmgHeJ6WJ03JqM-8Kk_z3h-g4R7aY_ozTgNkg@mail.gmail.com>

On Thu, Jan 24, 2013 at 12:34 PM, Simon Sapin <simon.sapin@kozea.fr> wrote:
> css3-syntax defines "non-ASCII character" as "A character with a codepoint
> equal to or greater than U+00A0 NO-BREAK SPACE."
>
> Could we change that to "… equal to or greater than U+0080"? In other words,
> having the whole block of C1 control characters be considered non-ASCII. I
> don’t think that more than a tiny amount of existing content would be
> affected, but I don’t have any data to support that. Maybe I’m still a
> decade too late, I don’t know.
>
>
> Why? First, it’s weird. In any definition that I can find, ASCII stops at
> most at 0x7F. I understand the feeling that control characters should not be
> part of identifiers, but then why exclude them but not Unicode’s many other
> non-characters, whitespace or punctuation? Let’s not go there, handling
> non-ASCII uniformly is much simpler.
>
> This peculiar definition of non-ASCII does not seem to have a reason, other
> that being a remain of CSS1 where the only non-ASCII characters are [¡-ÿ],
> the "printable" part of Latin-1.
>
>
> Perhaps more importantly, this change would make some implementation
> strategies easier. If all non-ASCII characters are treated the same, a
> tokenizer could use UTF-8 bytes as its internal representation of text and
> work by only looking at individual bytes. Sequences of non-ASCII codepoints
> map 1:1 to sequences of non-ASCII bytes in UTF-8.
>
> With "non-ASCII" starting at U+00A0 however this is not so easy, because
> U+0080~U+009F has to be discriminated from other multi-byte UTF-8 sequences.

As I stated in IRC, I suspect (but have no explicit evidence) that the
reason we start it at A0 is because 80-9F are non-printable
characters, and there's no reason to use them in a CSS value anyway.
It's easy to just shift the starting point, so we did so; further
isolation of other groups of Unicode non-printing chars is not worth
the effort.

That said, I agree that it's almost certain that basically no content
would be affected by changing the definition to start at 80, so I
don't have a problem with doing so.

~TJ

Received on Thursday, 24 January 2013 21:01:36 UTC