- From: Simon Sapin <simon.sapin@kozea.fr>
- Date: Fri, 25 Jan 2013 09:48:15 +0100
- To: Glenn Adams <glenn@skynav.com>
- CC: Bjoern Hoehrmann <derhoermi@gmx.net>, www-style list <www-style@w3.org>
Le 25/01/2013 07:06, Glenn Adams a écrit : > > On Thu, Jan 24, 2013 at 2:59 PM, Simon Sapin <simon.sapin@kozea.fr > <mailto:simon.sapin@kozea.fr>> wrote: > > This would address the current definition being "wrong" but not what > I really want. Which is being able to implement a conforming > tokenizer that, for efficiency, pretends that UTF-8 bytes are code > points. > > > What do you mean by this exactly? UTF-8 bytes match Unicode code points > only in the ASCII range (0x00 - 0x7F). Yes, exactly! If all non-ASCII code points (including U+0080 to U+009F) are treated the same in the tokenizer, it means that an implementation could represent the input as a list of UTF-8 bytes rather than a list of code points (consider the "current input byte" rather than "current input character"), and obtain the same tokens. Multi-byte UTF-8 sequences will always end up together in the value of eg. an ident token. But with that silly definition of "non-ASCII" this doesn’t work. When encountering a 0x80~0xFF byte (ok, actually only 0xC0), this implementation would have to decode the multi-byte UTF-8 sequence and check for U+0080~U+009F. -- Simon Sapin
Received on Friday, 25 January 2013 08:48:58 UTC