W3C home > Mailing lists > Public > www-style@w3.org > August 2013

Re: [css-syntax] Defining "character"

From: Simon Sapin <simon.sapin@exyr.org>
Date: Mon, 12 Aug 2013 17:59:25 +0100
Message-ID: <5209146D.8030208@exyr.org>
To: www-style@w3.org
Le 12/08/2013 17:25, Zack Weinberg a écrit :
> On Mon, Aug 12, 2013 at 7:35 AM, Simon Sapin <simon.sapin@exyr.org> wrote:
>> data:text/html,<style>body:before{}</style><script>document.styleSheets[0].cssRules[0].style.content="'-\ud834\udd1e-'"</script>
> That JavaScript strings expose surrogate pairs to the programmer is a
> (unfixable due to backward compatibility) specification bug in
> JavaScript, which should not infect CSS; the behavior on our side
> should IMHO be as-if the surrogate pair is converted to the
> corresponding code point before tokenization, such that the modified
> style sheet is indistinguishable from the one produced by
> data:text/html,<style>body:before{content:'-\01d11e -'}</style>

Yes. That’s fine: surrogate pairs are how you’re supposed to do non-BMP 
codepoints in Javascript. The trouble is with unpaired surrogates:


> I think this is the behavior already specified, and at most we need to
> make a note that regardless of how the parser is invoked, UTF-16 input
> counts as a "byte stream" requiring decoding per section 3.2 before
> tokenization, perhaps giving this case as an example.

The UTF-16 decoder emits U+FFFD for unpaired surrogates. This sounds 
fine, except that it’s not what implementations do. They seem to all let 
unpaired surrogate codepoints through.

Simon Sapin
Received on Monday, 12 August 2013 16:59:48 UTC

This archive was generated by hypermail 2.4.0 : Monday, 23 January 2023 02:14:30 UTC