Re: [css3-syntax] Reviving the spec, starting with the parser from Simon Sapin on 2012-04-12 (www-style@w3.org from April 2012)

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Thu, 12 Apr 2012 18:08:30 +0200
To: "Tab Atkins Jr." <jackalmage@gmail.com>
CC: www-style list <www-style@w3.org>
Message-ID: <4F86FDFE.8050304@kozea.fr>

Le 12/04/2012 17:22, Tab Atkins Jr. a écrit :
>> >  I also suggest making the supported range implementation-dependent. The
>> >  current highest unicode codepoint is 0x10ffff, but some "broken" platforms
>> >  only support up to 0xffff (ie. only inside the BMP).
> CSS doesn't currently allow platforms to not support all of unicode.
> Do you have specific examples of platforms in use that are broken in
> this way that we should support?

CPython before 3.3 has a compile-time switch to make the internal 
storage for codepoints UCS-4 instead of UCS-2. The sys.maxunicode 
constant reflects that. (It is either 1114111 or 65535). Calling chr(x) 
with x > sys.maxunicode raises an exception.

Decoding a non-BMP character from bytes on an USC-2 build creates two 
codepoints for the surrogate pair. This is wrong (eg. slicing can split 
the pairs) but kind of works out when encoding back to bytes.

Although I’m not as familiar with the details, I think that Java and 
Javascript have similar issues. (Due to pretending that all of Unicode 
is still 16 bits and UTF-16 is the same as UCS-2.)

Depending on what is done with the parsed stylesheet, decoding a single 
hex escape to a surrogate pair of codepoints might "work" (as in, use 
the right glyph if displayed on a screen eventually). Is this behavior 
acceptable? (Maybe it does not matter for CSS?)


>> >  Also, \0 sometimes have a special meaning and cannot be used in the middle
>> >  of a string. This could be expressed by having the supported range start at
>> >  U+0001 instead of U+0000.
> I just tested Chrome, Firefox, and IE 8, and only Chrome handles a \0
> in a string correctly.  Firefox bails and pretends I was trying to
> escape a '0', and IE is just*weird*  - it emits a replacement
> character and then turns the remainder of the string into replacement
> characters too.

I was thinking of Firefox’s behavior which seemed a weird work-around, 
but I’ll let Mozilla people speak on that.

-- 
Simon Sapin

Received on Thursday, 12 April 2012 16:09:01 UTC