[csswg-drafts] [css-syntax-3] Input stream processing can calculate wrong encoding (#4126) from Mark Rogers via GitHub on 2019-07-18 (public-css-archive@w3.org from July 2019)

From: Mark Rogers via GitHub <sysbot+gh@w3.org>
Date: Thu, 18 Jul 2019 21:03:44 +0000
To: public-css-archive@w3.org
Message-ID: <issues.opened-469988272-1563483823-sysbot+gh@w3.org>

dd8 has just created a new issue for https://github.com/w3c/csswg-drafts:

== [css-syntax-3] Input stream processing can calculate wrong encoding  ==
There's a difference between the encoding calculated by the css-syntax-3 spec and the CSS 2.1/2.2 spec, demonstrated by this file:
http://test.csswg.org/suites/css2.1/20110323/html4/support/at-charset-001.css

It's served as Content-Type: text/css; charset=shift_jis. It also starts with a Shift_JIS byte sequence that happens to match the UTF-8 BOM (great test case)

ef bb bf 2e e5 b9 b3 e5 92 8c 0d 0a 7b 0d 0a 20 |............{.. |

CSS 2.1/2.2 specifies that Content-Type wins over any BOM:
https://drafts.csswg.org/css2/syndata.html#charset

css-syntax-3 uses the 'Decode' algorithm and says the decode algorithm gives precedence to a byte order mark (BOM), and only uses the fallback when none is found.
https://drafts.csswg.org/css-syntax/#input-byte-stream

This CSS 2.x algorithm gets the correct encoding for the test file (Shift_JIS) but the CSS 3 algorithm gets the wrong encoding (UTF-8). Chrome and Firefox both seem to use the 2.x method of calculating encoding for CSS.

FWIW I think the 'Decode' algorithm works well with HTML (and XML) because they're most likely to begin with `<!DOCTYPE` `<!--comment` `<html` or whitespace so can't accidentally match the BOM with an ASCII compatible encoding. I think 'Decode' works less well for CSS which can start with any non-ASCII code points as part of a CSS selector.


Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/4126 using your GitHub account

Received on Thursday, 18 July 2019 21:03:51 UTC