Re: [css3-syntax] CSS escape sequences from Jonathan Kew on 2012-01-12 (www-style@w3.org from January 2012)

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Thu, 12 Jan 2012 13:17:44 +0000
To: www-style <www-style@w3.org>
Message-Id: <7D68184F-986D-49E8-9DC2-CA6E666DC8A8@jfkew.plus.com>

On 12 Jan 2012, at 10:35, Mathias Bynens wrote:

> http://www.w3.org/TR/css3-syntax/#characters defines CSS escape
> sequences of the form `\000026` or `\26 `, both of which decode to
> `&`.
> 
> WebKit browsers don’t support this syntax for characters outside the
> BMP: https://bugs.webkit.org/show_bug.cgi?id=76152 For example,
> `\1d306 ` or `\01d306` are supposed to be escape sequences for the
> “tetragram for centre” symbol (U+1D306), but they don’t work in
> WebKit.
> 
> There seems to be another way to escape these characters, namely by
> breaking them up in UTF-16 code units: `\d834\df06 `. All browsers
> except Gecko (https://bugzilla.mozilla.org/show_bug.cgi?id=717529)
> seem to support this, even though this isn’t mentioned in the spec.

This looks suspiciously like an (inadvertent?) artifact of the use of UTF-16 as the encoding form for strings within the browser. Suppose a browser happened to use UTF-8 as its internal string format; should it then treat "\f0\9d\8c\86" as meaning U+1D306? (Of course not. But that would be analogous to treating "\d834\df06" that way just because the browser happens to use UTF-16 internally.)

> Should the spec be changed to reflect reality?

CSS backslash-hexadecimal character escapes are supposed to represent ISO 10646 character codes, *NOT* UTF-16 code units.

As such, I think interpreting "\d834\df06" as the character U+1D306 should be considered a bug, and the spec should perhaps be clarified with a note explicitly prohibiting this behavior.

JK

Received on Thursday, 12 January 2012 13:23:22 UTC