Re: [csswg-drafts] [cssom][all specs defining IDL] Consider USVString instead of DOMString, replacing surrogates with U+FFFD

The CSS Working Group just discussed Consider using USVString instead of DOMString, and agreed to the following resolutions:

```
RESOLVED: CSSOM can use either USVString or DOMString
```

<details><summary>The full IRC log of that discussion</summary>

```
<TabAtkins> Topic: Consider using USVString instead of DOMString
<fantasai> ScribeNick: fantasai
<fantasai> SimonSapin: In JS, strings are made of a sequence of 16-bit integers
<fantasai> SimonSapin: Can be arbitrary sequence
<fantasai> SimonSapin: Usually interpreted as UTF-16
<fantasai> SimonSapin: But don't have to be well-formed UTF-16
<fantasai> SimonSapin: In particular, range of values that are called surrogates
<fantasai> SimonSapin: If you have a leading surrogate plus trailing surrogate, i.e. 2 UTF-16 ints, that forms a single Unicode codepoint
<fantasai> SimonSapin: But in JS, nothing stops surrogates from appearing in the wrong order, or a single one by itself
<fantasai> SimonSapin: This is invalid Unicode
<fantasai> SimonSapin: But you can do it in JS
<fantasai> SimonSapin: If you want to convert that string to UTF-8, UTF-8 is designed to exclude surrogate codepoints to align with set of valid UTF-16 strings
<fantasai> SimonSapin: So not all JS strings can be represented in UTF-8 without losing data or using escaping mechanism
<fantasai> SimonSapin: Wasn't an issue, because every browser internally uses same type of string as JS
<dbaron> Github topic: https://github.com/w3c/csswg-drafts/issues/1217
<fantasai> SimonSapin: So if ou have CSSOM string that has unpaired surrogate, e.g. in an ident or content property string
<fantasai> SimonSapin: it's ok
<fantasai> SimonSapin: What's changing now is that in Firefox, we have a project called Stylo, which is to import Servo style system into Gecko
<fantasai> SimonSapin: That style system is using Rust str type for all strings
<fantasai> SimonSapin: which is based on UTF-8, so it cannot represent unpaired surrogates
<fantasai> SimonSapin: what that means is in practice, whenever a string comes from CSSOM and goes into the style system in Servo and in the future in Firefox, we convert to UTF-8 and in that process, any unpaired surrogate is replaced with U+FFFD REPLACEMENT CHARACTER
<fantasai> SimonSapin: So there is some data loss
<fantasai> SimonSapin: However, I think this kind of situation only happens accidentally
<fantasai> SimonSapin: Fact that JS strings are this way is not a feature, it's a historical accident
<fantasai> SimonSapin: I don't think there is a real compat risk with shipping Firefox this way
<fantasai> SimonSapin: Still, it's a deviation from current interoperable behavior, so wanted to bring it up
<fantasai> Florian: Proposal?
<fantasai> SimonSapin: In WebIDL which we use to define itnerfaces for JS, there is two string types. DOMString corresponds to JS strings with aribtrary 16-bit
<fantasai> SimonSapin: There is USVString, Unicode Scalara Value String, which has no unpaired surrogates, only well-formed unicode
<fantasai> SimonSapin: When you convert DOMString to that, you get the same behavior as in Servo, replacing lone surrogates with UFFFD
<fantasai> SimonSapin: If we want to keep this interoperable, then I propse to use USVString for all of CSSOM
<fantasai> ChrisL_: Seems like a good idea, since unpaired surrogates are only an error
<fantasai> ChrisL_: Only used for binary data, and cna't imagine that in CSSOM
<fantasai> TabAtkins: USVString is supposed to be avoided in WebIDL
<fantasai> TabAtkins: Currently only used in networking protocols that use scalar values
<fantasai> TabAtkins: Requires extra processing compared to UTF-16 strings
<fremy> ChrisL_: maybe in custom properties though, people would want to store binary data; they should encode it to avoid syntax issues though so no big deal
<fantasai> dbaron: Anne disagrees with advice in WebIDl spec, btw
<tantek> s/Anne/Annevk
<fantasai> dbaron: There's a github issue against WebIDL spec to give coherent advice, but ppl disagree on what that should be
<fantasai> dino: Appreciate that you want to use rust string type, but we all have to use our own string types
<fantasai> dino: Maybe resolution is all DOM strings should be that way
<fantasai> TabAtkins: No, that would break a lot of things
<fantasai> TabAtkins: ppl smuggle binary date in JS strings
<fantasai> TabAtkins: But for things that talk text, coudl do it
<fantasai> dino: Everything, not just CSSOM
<fantasai> Florian: Would it be reasonable for implementations that don't do rust strings internally
<fantasai> myles_: If we don't know perf impact, can't agree to do this
<dbaron> myles_: so somebody somewhere has to try it first before we agree to it
<fantasai> SimonSapin: Tab, ?
<iank_> q+
<fantasai> TabAtkins: Some DOM Apis have to be 16-bit, e.g. Fetch ...
<fantasai> rbyers: It's not in Chrome
<fantasai> TabAtkins muses
<SimonSapin> s/?/did you mean changing JS or DOM would break things?/
<fantasai> till: It's not entirely out of the question that it would be Web-compatible enough that we could change it in JS itself
<fantasai> fantasai: For JS itself, Tab was saying it's not doable, but for DOM Apis more likely to be possible
<fantasai> iank_: Need to check with architecture folks about this
<fantasai> iank_: our architectue folks in charge of bindings and stirng types and stuff
<fantasai> iank_: Looked for code where we switch to USVStrings, and that's very expensive for us it looks like
<fantasai> iank_: Might be perf problems
<fantasai> fantasai: My take is that this is a veyr weird edge case with no real use... lone surrogates in the CSSOM.
<fantasai> fantasai: So I would say, let's spec you can use either, and we don't care.
<fantasai> myles_: Every string would have to get transcoded, that's crazy
<fantasai> TabAtkins: ....
<fantasai> iank_: Would have to guarantee that htat block internally is clean
<fantasai> TabAtkins: Move to UTF-8 clean internally
<fantasai> iank_: Sounds non-triial
<fantasai> Florian: Spec that either is Okay then it's not any work
<fantasai> Florian: If we can't spec that, then it means web depends on it, so Servo will have to bite the bullet
<fantasai> TabAtkins: I'm okay with doing that, put a note that we'd like to move th USVString
<fantasai> shane: If there's a webb compat problem, then it's a problem
<fantasai> TabAtkins: That means someone is injecting lone surrogates into the CSSOM. Can't come out of the parser
<fantasai> TabAtkins: In that case probably buggy anyway
<fantasai> shane: If ppl notice a problem, they'll file bugs
<fantasai> eae: It's very hard to get into the situation except intentionally
<fantasai> myles_: Does Servo have to translate between JS string and USVString all the time?
<fantasai> SimonSapin: Yes
<fantasai> SimonSapin: We have optimizations, e.g. if ascii then stord in one byte per uit, skip UTF-8 conversion
<fantasai> s/USVString/UTF-8 String/
<fantasai> Florian: Can we just resolve on both and if it's a problem, come back and we'll change hte spec?
<fantasai> fantasai: I think interop in this very very weird case is not worth any effort, so it should allow both
<fantasai> till: It's not servo-specific, others might want UTF-8 codepaths
<fantasai> myles_: I believe that you believe that.
<fantasai> Rossen: Anyy objections?
<fantasai> rbyers: We should rediscuss if we find web compat issues
<fantasai> RESOLVED: CSSOM can use either USVString or DOMString
<fantasai> fantasai: We can alwasy raise issues if they're found later.
<fantasai> SimonSapin: This also affects other specs with WebIDL interfaces, e.g. CSS Fonts defines @font-face interfaces
<fantasai> Florian: Should we define a CSSString?
<fantasai> ...
<fantasai> iank_: But if we do this later..
<fantasai> fantasai: We are literally deciding that you can do either, forever. Unless someone comes back and says "lone surrogates in CSSOm are an important use case and I need them"
<fantasai> Rossen discusses agenda items
```
</details>


-- 
GitHub Notification of comment by css-meeting-bot
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/1217#issuecomment-295053842 using your GitHub account

Received on Wednesday, 19 April 2017 03:08:35 UTC