Re: [css3-syntax] CSS escape sequences from Jonathan Kew on 2012-01-12 (www-style@w3.org from January 2012)

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Thu, 12 Jan 2012 14:00:45 +0000
To: www-style Style <www-style@w3.org>
Message-Id: <FFC47596-8039-4BD3-9712-7EE63911DB79@jfkew.plus.com>

On 12 Jan 2012, at 13:41, Mathias Bynens wrote:

> On Thu, Jan 12, 2012 at 2:01 PM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
>> I so no reason why it should be.
> 
> Well, all existing engines (except Gecko) already support this, and
> have for years. Pave The Cowpaths, etc.
> 
>> This looks suspiciously like an (inadvertent?) artifact of the use of UTF-16 as the encoding form for strings within the browser.
> 
> It does, but does that matter?
> 
>>> Should the spec be changed to reflect reality?
>> 
>> CSS backslash-hexadecimal character escapes are supposed to represent ISO 10646 character codes, *NOT* UTF-16 code units.
>> 
>> As such, I think interpreting "\d834\df06" as the character U+1D306 should be considered a bug, and the spec should perhaps be clarified with a note explicitly prohibiting this behavior.
> 
> As it stands, it probably is a bug, but rather than dismiss it we
> could embrace it and spec it in a way that is backwards compatible
> with these implementations — remember, we’re talking about *all
> browsers except Firefox* here. Something like: “If a UTF-16 surrogate
> pair is found, decode it as such; else, proceed as usual.”

What if an unpaired UTF-16 surrogate codepoint is found? ("Proceed as usual", I suppose. What do the various browsers do with this currently?)

My preference would be to explicitly disallow character escapes in the range \d800..\dfff. That seems cleaner, simpler to understand, and easier to implement than special-casing pairs of <high surrogate, low surrogate> and then deciding how to deal with unpaired surrogates sensibly.

JK

Received on Thursday, 12 January 2012 14:03:03 UTC