[csswg-drafts] [cssom][all specs defining IDL] Consider USVString instead of DOMString, replacing surrogates with U+FFFD

SimonSapin has just created a new issue for https://github.com/w3c/csswg-drafts:

== [cssom][all specs defining IDL] Consider USVString instead of DOMString, replacing surrogates with U+FFFD ==
CSSOM uses WebIDL’s `DOMString` type for all string parameters and return values. It corresponds to JavaScript strings: arbitrary sequences of 16-bit code units. These are usually interpreted as UTF-16, but they’re not necessarily well-formed in UTF-16: they can contain [unpaired surrogate code units](https://simonsapin.github.io/wtf-8/#surrogates-code-units). I sometimes call this encoding WTF-16.

(Character encoding decoders never emit surrogates when decoding bytes from the network, even when decoding UTF-16BE or UTF-16LE. So surrogates can’t end up in a string that way, only through JS.)

WebIDL also defines `USVString` which is a Unicode string. (A sequence of Unicode scalar values, excluding surrogate code points.) When [converting to it from a JavaScript string](https://heycam.github.io/webidl/#es-USVString), unpaired surrogate are [replaced with the replacement character U+FFFD](https://heycam.github.io/webidl/#dfn-obtain-unicode).

As far as I know all major browser engines currently use WTF-16 internally, so they preserve unpaired surrogates "by default" when strings go through various browser components where no code is actively looking for those.

In Firefox, we’re working on a new style system ([Stylo, a.k.a. Quantum CSS](https://wiki.mozilla.org/Quantum/Stylo)) where strings are represented with Rust’s native `&str` type. `&str` uses UTF-8 bytes for its in-memory representation of Unicode and guarantees (as part of the type’s contract) that these bytes are well-formed in UTF-8. Unicode designed UTF-8 to specifically exclude surrogate code points, in order to be compatible with (well-formed) UTF-16. As a consequence, well-formed UTF-8 (and `&str`) can not represent all JavaScript strings without some sort of escape sequence mechanism.

Stylo currently replaces unpaired surrogates with U+FFFD when converting JS strings to UTF-8. This is equivalent to defining WebIDL interfaces with `USVString` instead of `DOMString`. This is a deviation from specified and currently-interoperable behavior.

It would be possible to make Stylo preserve surrogates (for example by moving everything to [WTF-8](https://simonsapin.github.io/wtf-8/)). However we’re inclined not to. Preserving surrogates is an historical accident, not a feature. I argue that any occurrence of surrogates in a JS string is likely an error, and coming up with an example where not preserving them in CSSOM makes an observable difference is extremely convoluted. For example:

http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5012
```html
<!DOCTYPE html>
<style></style>
<script>
document.documentElement.classList.add('\uD800');
document.styleSheets[0].insertRule('.\uD800:before { content: "Surrogates can be used in class names." }', 0);
document.styleSheets[0].insertRule('.\uD801:before { content: "Surrogates seem to be mapped to U+FFFD." }', 1);
</script>
```

So I would like to propose changing CSSOM and other CSS specifications that declare WebIDL interfaces to use `USVString` instead of `DOMString`. This makes CSS syntax “Unicode-clean”, and enable implementations to use UTF-8 internally.

CSSWG discussed and rejected [in 2014](https://lists.w3.org/Archives/Public/www-style/2014Jun/0060.html) a proposal that was effectively the same. However neither `USVString` nor Stylo existed at the time. What has changed is that WebIDL now gives us the tool to easily specify this change, and one major implementation is on a path to likely to make this change.

Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/1217 using your GitHub account

Received on Thursday, 13 April 2017 09:33:22 UTC