Re: [css3-syntax] Making U+0080 to U+009F "non-ASCII"?

Le 25/01/2013 07:06, Glenn Adams a écrit :
>
> On Thu, Jan 24, 2013 at 2:59 PM, Simon Sapin <simon.sapin@kozea.fr
> <mailto:simon.sapin@kozea.fr>> wrote:
>
>     This would address the current definition being "wrong" but not what
>     I really want. Which is being able to implement a conforming
>     tokenizer that, for efficiency, pretends that UTF-8 bytes are code
>     points.
>
>
> What do you mean by this exactly? UTF-8 bytes match Unicode code points
> only in the ASCII range (0x00 - 0x7F).

Yes, exactly! If all non-ASCII code points (including U+0080 to U+009F) 
are treated the same in the tokenizer, it means that an implementation 
could represent the input as a list of UTF-8 bytes rather than a list of 
code points (consider the "current input byte" rather than "current 
input character"), and obtain the same tokens. Multi-byte UTF-8 
sequences will always end up together in the value of eg. an ident token.

But with that silly definition of "non-ASCII" this doesn’t work. When 
encountering a 0x80~0xFF byte (ok, actually only 0xC0), this 
implementation would have to decode the multi-byte UTF-8 sequence and 
check for U+0080~U+009F.

-- 
Simon Sapin

Received on Friday, 25 January 2013 08:48:58 UTC