Re: [css3-syntax] Making U+0080 to U+009F "non-ASCII"? from Simon Sapin on 2013-01-25 (www-style@w3.org from January 2013)

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Fri, 25 Jan 2013 09:48:15 +0100
To: Glenn Adams <glenn@skynav.com>
CC: Bjoern Hoehrmann <derhoermi@gmx.net>, www-style list <www-style@w3.org>
Message-ID: <510246CF.5010503@kozea.fr>

Le 25/01/2013 07:06, Glenn Adams a écrit :
>
> On Thu, Jan 24, 2013 at 2:59 PM, Simon Sapin <simon.sapin@kozea.fr
> <mailto:simon.sapin@kozea.fr>> wrote:
>
>     This would address the current definition being "wrong" but not what
>     I really want. Which is being able to implement a conforming
>     tokenizer that, for efficiency, pretends that UTF-8 bytes are code
>     points.
>
>
> What do you mean by this exactly? UTF-8 bytes match Unicode code points
> only in the ASCII range (0x00 - 0x7F).

Yes, exactly! If all non-ASCII code points (including U+0080 to U+009F) 
are treated the same in the tokenizer, it means that an implementation 
could represent the input as a list of UTF-8 bytes rather than a list of 
code points (consider the "current input byte" rather than "current 
input character"), and obtain the same tokens. Multi-byte UTF-8 
sequences will always end up together in the value of eg. an ident token.

But with that silly definition of "non-ASCII" this doesn’t work. When 
encountering a 0x80~0xFF byte (ok, actually only 0xC0), this 
implementation would have to decode the multi-byte UTF-8 sequence and 
check for U+0080~U+009F.

-- 
Simon Sapin

Received on Friday, 25 January 2013 08:48:58 UTC