W3C home > Mailing lists > Public > www-style@w3.org > January 2013

Re: [css3-syntax] Making U+0080 to U+009F "non-ASCII"?

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Fri, 25 Jan 2013 09:48:15 +0100
Message-ID: <510246CF.5010503@kozea.fr>
To: Glenn Adams <glenn@skynav.com>
CC: Bjoern Hoehrmann <derhoermi@gmx.net>, www-style list <www-style@w3.org>
Le 25/01/2013 07:06, Glenn Adams a écrit :
> On Thu, Jan 24, 2013 at 2:59 PM, Simon Sapin <simon.sapin@kozea.fr
> <mailto:simon.sapin@kozea.fr>> wrote:
>     This would address the current definition being "wrong" but not what
>     I really want. Which is being able to implement a conforming
>     tokenizer that, for efficiency, pretends that UTF-8 bytes are code
>     points.
> What do you mean by this exactly? UTF-8 bytes match Unicode code points
> only in the ASCII range (0x00 - 0x7F).

Yes, exactly! If all non-ASCII code points (including U+0080 to U+009F) 
are treated the same in the tokenizer, it means that an implementation 
could represent the input as a list of UTF-8 bytes rather than a list of 
code points (consider the "current input byte" rather than "current 
input character"), and obtain the same tokens. Multi-byte UTF-8 
sequences will always end up together in the value of eg. an ident token.

But with that silly definition of "non-ASCII" this doesn’t work. When 
encountering a 0x80~0xFF byte (ok, actually only 0xC0), this 
implementation would have to decode the multi-byte UTF-8 sequence and 
check for U+0080~U+009F.

Simon Sapin
Received on Friday, 25 January 2013 08:48:58 UTC

This archive was generated by hypermail 2.3.1 : Monday, 2 May 2016 14:39:07 UTC