Re: [csswg-drafts] [css-syntax] The tokenizer input should probably be a stream of scalar values, not codepoints (#3307)

(I can't attend this week's meeting, so consider this an Agenda+ for *next* week, unless someone feels like presenting it based on the below.)

Details:

* per <https://drafts.csswg.org/css-syntax/#input-byte-stream>, the decode operation is used to turn bytes into codepoints; this algorithm will only return scalar values (never surrogate codepoints)
* per <https://drafts.csswg.org/css-syntax/#consume-escaped-code-point>, using a codepoint escape will always produce scalar values (never surrogate codepoints)
* the only way, currently, to produce a surrogate codepoint is by directly assigning a DOMString with one in it via an OM operation
* Firefox doesn't do this; presumably because they use USVString for their CSSOMString, assigning a string with a surrogate in it will replace the surrogate with U+FFFD.
* Syntax already has *some* codepoint filtering built in: <https://drafts.csswg.org/css-syntax/#input-preprocessing> replaces newlines with a canonical newline, and replaces NULL with U+FFFD (same as trying to escape a null).

So, my proposal is that we add another codepoint-filtering rule that turns surrogates into U+FFFD, effectively meaning that all the algorithms take streams of scalar values. This will match Firefox's behavior, and overall harmonize the various entry points with each other about which codepoints can be used.

-- 
GitHub Notification of comment by tabatkins
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/3307#issuecomment-442198338 using your GitHub account

Received on Tuesday, 27 November 2018 20:05:58 UTC