Re: [css-syntax] Defining "character" from Zack Weinberg on 2013-08-12 (www-style@w3.org from August 2013)

From: Zack Weinberg <zackw@panix.com>
Date: Mon, 12 Aug 2013 09:25:00 -0700
To: Simon Sapin <simon.sapin@exyr.org>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, www-style list <www-style@w3.org>
Message-ID: <CAKCAbMi0wQS8yi8Mui6W42r2KriXNVymMquCzswqYcWm3tb8dQ@mail.gmail.com>

On Mon, Aug 12, 2013 at 7:35 AM, Simon Sapin <simon.sapin@exyr.org> wrote:
> data:text/html,<style>body:before{}</style><script>document.styleSheets[0].cssRules[0].style.content="'-\ud834\udd1e-'"</script>

That JavaScript strings expose surrogate pairs to the programmer is a
(unfixable due to backward compatibility) specification bug in
JavaScript, which should not infect CSS; the behavior on our side
should IMHO be as-if the surrogate pair is converted to the
corresponding code point before tokenization, such that the modified
style sheet is indistinguishable from the one produced by

data:text/html,<style>body:before{content:'-\01d11e -'}</style>

I think this is the behavior already specified, and at most we need to
make a note that regardless of how the parser is invoked, UTF-16 input
counts as a "byte stream" requiring decoding per section 3.2 before
tokenization, perhaps giving this case as an example.

zw

Received on Monday, 12 August 2013 16:25:27 UTC