Re: [css3-syntax] Reviving the spec, starting with the parser from Tab Atkins Jr. on 2012-04-12 (www-style@w3.org from April 2012)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Thu, 12 Apr 2012 09:28:19 -0700
To: Simon Sapin <simon.sapin@kozea.fr>
Cc: www-style list <www-style@w3.org>
Message-ID: <CAAWBYDAfy6fsmuw9X+eOh1T9fT4D0zj2PG7+q5RvXb+yq+qk2Q@mail.gmail.com>

On Thu, Apr 12, 2012 at 9:08 AM, Simon Sapin <simon.sapin@kozea.fr> wrote:
> Le 12/04/2012 17:22, Tab Atkins Jr. a écrit :
>>> >  I also suggest making the supported range implementation-dependent.
>>> > The
>>> >  current highest unicode codepoint is 0x10ffff, but some "broken"
>>> > platforms
>>> >  only support up to 0xffff (ie. only inside the BMP).
>>
>> CSS doesn't currently allow platforms to not support all of unicode.
>> Do you have specific examples of platforms in use that are broken in
>> this way that we should support?
>
>
> CPython before 3.3 has a compile-time switch to make the internal storage
> for codepoints UCS-4 instead of UCS-2. The sys.maxunicode constant reflects
> that. (It is either 1114111 or 65535). Calling chr(x) with x >
> sys.maxunicode raises an exception.
>
> Decoding a non-BMP character from bytes on an USC-2 build creates two
> codepoints for the surrogate pair. This is wrong (eg. slicing can split the
> pairs) but kind of works out when encoding back to bytes.
>
> Although I’m not as familiar with the details, I think that Java and
> Javascript have similar issues. (Due to pretending that all of Unicode is
> still 16 bits and UTF-16 is the same as UCS-2.)

Javascript (and, I assume Python and Java) just need extra work to
make this work correctly.  It's an inconvenience, not a fundamental
limitation.

> Depending on what is done with the parsed stylesheet, decoding a single hex
> escape to a surrogate pair of codepoints might "work" (as in, use the right
> glyph if displayed on a screen eventually). Is this behavior acceptable?
> (Maybe it does not matter for CSS?)

Whether or not it "works" depends on the exact details of what you're
doing, but it will at least have predictable behavior - both halves
will fall into the "non-ASCII character" bucket and get processed
fairly normally.  In JS, emitting a string with a surrogate pair will
work correctly.

~TJ

Received on Thursday, 12 April 2012 16:29:12 UTC