Re: [css-syntax] Defining "character" from Simon Sapin on 2013-08-12 (www-style@w3.org from August 2013)

From: Simon Sapin <simon.sapin@exyr.org>
Date: Mon, 12 Aug 2013 17:59:25 +0100
To: www-style@w3.org
Message-ID: <5209146D.8030208@exyr.org>

Le 12/08/2013 17:25, Zack Weinberg a écrit :
> On Mon, Aug 12, 2013 at 7:35 AM, Simon Sapin <simon.sapin@exyr.org> wrote:
>> data:text/html,<style>body:before{}</style><script>document.styleSheets[0].cssRules[0].style.content="'-\ud834\udd1e-'"</script>
>
> That JavaScript strings expose surrogate pairs to the programmer is a
> (unfixable due to backward compatibility) specification bug in
> JavaScript, which should not infect CSS; the behavior on our side
> should IMHO be as-if the surrogate pair is converted to the
> corresponding code point before tokenization, such that the modified
> style sheet is indistinguishable from the one produced by
>
> data:text/html,<style>body:before{content:'-\01d11e -'}</style>

Yes. That’s fine: surrogate pairs are how you’re supposed to do non-BMP 
codepoints in Javascript. The trouble is with unpaired surrogates:

data:text/html,<style>body:before{}</style><script>document.styleSheets[0].cssRules[0].style.content="'-\ud834-\udd1e-'"</script>


> I think this is the behavior already specified, and at most we need to
> make a note that regardless of how the parser is invoked, UTF-16 input
> counts as a "byte stream" requiring decoding per section 3.2 before
> tokenization, perhaps giving this case as an example.

The UTF-16 decoder emits U+FFFD for unpaired surrogates. This sounds 
fine, except that it’s not what implementations do. They seem to all let 
unpaired surrogate codepoints through.

-- 
Simon Sapin

Received on Monday, 12 August 2013 16:59:48 UTC