Re: [csswg-drafts] [css-syntax-3] Input stream processing can calculate wrong encoding (#4126) from Mark Rogers via GitHub on 2019-12-13 (public-css-archive@w3.org from December 2019)

From: Mark Rogers via GitHub <sysbot+gh@w3.org>
Date: Fri, 13 Dec 2019 17:24:28 +0000
To: public-css-archive@w3.org
Message-ID: <issue_comment.created-565529695-1576257867-sysbot+gh@w3.org>

> It's served as Content-Type: text/css; charset=shift_jis. It also starts with a Shift_JIS byte sequence that happens to match the UTF-8 BOM (great test case)
`ef bb bf 2e e5 b9 b3 e5 92 8c 0d 0a 7b 0d 0a 20 |............{.. |`

>> Circling back around: do you have any evidence of pages breaking due to this behavior?

Not other than the test case ... but I don't see enough CSS files using scripts where this might be a problem to provide enough data either way.

There are likely to be encodings other than Shift-JIS where `ef bb bf 2e` can appear as valid characters at the start of a file.

I guess my concerns are two-fold:

1) there's a coupling between BOM sniffing and the syntax of the document being sniffed - it's more reliable with some document types because their syntax makes it unlikely/impossible to have  non-ASCII characters at offset zero

2) this coupling means sniffing can become less reliable due to syntax changes unrelated to CSS. For example, HTML introducing custom element names means `ef bb bf 2e` is more likely to appear at offset zero in CSS as the name of an element style rule.



-- 
GitHub Notification of comment by dd8
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/4126#issuecomment-565529695 using your GitHub account

Received on Friday, 13 December 2019 17:24:29 UTC