Re: [CSS21] questions about Lex regexes used to define tokens from Simon Montagu on 2011-06-14 (www-style@w3.org from June 2011)

From: Simon Montagu <smontagu@smontagu.org>
Date: Tue, 14 Jun 2011 16:20:06 +0300
To: "www-style@w3.org" <www-style@w3.org>
Message-ID: <4DF76006.7030901@smontagu.org>

On 06/14/2011 12:38 PM, Mikko Rantalainen wrote:
> 2011-06-10 19:52 EEST: Joshua Cranmer:
>> On 6/10/2011 9:37 AM, Jack Smiley wrote:
>>> 3) Regarding the macro definition for nonascii, why does it go up to
>>> octal 237? (what's special about 237?) Why not octal 177 (decimal 127
>>> -- standard ASCII) or octal 377 (decimal 255 -- extended ASCII)?
>> Presumably, 238 and above is where you have individually invalid octets
>> for UTF-8.
>
> Isn't anything that has 8th bit set possibly invalid in UTF-8? Octal 177
> / decimal 127 makes more sense if UTF-8 compatibility is the reason for
> this limit.

Firstly these are Unicode code points, not octets in UTF-8 or any other 
encoding. See above, "Octal codes refer to ISO 10646".

Secondly, the macro *excludes* \0-\237. In other words it includes \240 
onwards, i.e. U+00A0 - U+FFFF (presumably no more, since the reference 
is to ISO/IEC 10646-1:2003, which includes the BMP only).

Received on Tuesday, 14 June 2011 13:20:33 UTC