- From: Tab Atkins Jr. <jackalmage@gmail.com>
- Date: Thu, 23 Jan 2014 10:18:13 -0800
- To: Asmus Freytag <asmusf@ix.netcom.com>
- Cc: Simon Sapin <simon.sapin@exyr.org>, "Phillips, Addison" <addison@lab126.com>, Anne van Kesteren <annevk@annevk.nl>, Richard Ishida <ishida@w3.org>, Zack Weinberg <zackw@panix.com>, www-style list <www-style@w3.org>, www International <www-international@w3.org>
On Thu, Jan 23, 2014 at 9:32 AM, Asmus Freytag <asmusf@ix.netcom.com> wrote: > Presumably the @charset continues to exist because there are situations > where it's needed. Yup, legacy stylesheets. > If the only conceivably situations are parsing of already existing > stylesheets, then an argument that any change that affects the > interpretation of any of these stylesheets is inherently risky and, in a > strict sense would violate the backwards compatibility, does have some > merit. (But remember, we are talking about making them behave as declared, > not behave in an unusual new way.) > > On the other hand, if there are any situations where someone needs to create > new stylesheets that are not in UTF-8 (for whatever reason) and if the > intent is to support newly created stylesheet that are not UTF-8 in such > cases, then having a syntax that is a "trap" is counter productive. Those > for whom the support of new non-UTF-8 stylesheets is perpetuated would be > likely to fail in correctly availing themselves of that feature, because > they happen to fall into the trap. There is no reason to create a new stylesheet in any encoding other than utf-8. We need to get out of the trap of thinking that encodings are in any way valuable. They're a legacy pain, and we've fixed the situation in practice by standardizing on a single encoding. > If the intent is to force people to use UTF-8, then the way to do that would > be to disregard the @charset declaration for any stylesheet that uses a > feature not present in legacy stylesheets. That would be a clean solution. No, that makes upgrading a legacy stylesheet fraught with danger. The presence of a "new" feature is not a version indicator, and CSS doesn't have explicit version indicators to rely on either. > Simply leaving a "gotcha" strikes me as suboptimal. It's not about "making > it more convenient", but about making the process more robust, by separating > the status of the feature (more or less deprecated) from accidents in > syntax. If an accident causes people to be frustrated and just write their stylesheets in utf-8, that's a win. ^_^ More importantly, the @charset rule has *always* worked this way, with the precise requirements on syntax. There are likely legacy stylesheets in the wild containing incorrectly written @charset rules, which are depending on it not working. (For example, they may have pasted in some unicode characters, and then fiddled with their text editor until it worked, ending up accidentally saving it in utf-8.) Changing the syntax thus has potential (and realistic) compatibility constraints. Note as well that the suggested changes (allowing any number of spaces between "@charset" and the string, and allowing either single or double quotes) still does *not* make it sufficiently robust. It's still trivial to construct an @charset rule that won't give any encoding information: @charset /* comment! */ "windows-1252"; Since comments can be an arbitrary length, accommodating this means that the encoding scanner has to look an *arbitrary distance* into the document in the general case, before returning to the beginning and parsing the stylesheet for real. This is not the type of thing we want to encourage. But that's not all! CSS is even more flexible than this! The syntax of an at-keyword token is basically just "@" followed by an ident. In particular, you can escape characters in the ident part, and it's equivalent to the non-escaped version. That means that the following are all *exactly equivalent* as far as the CSSOM and grammar validity are concerned: @charset @\63harset @\63 harset (To avoid ambiguous situations, an optional space is allowed after the escape!) @\63\68\61\72s\65\74 The same applies to the contents of strings - you can use CSS escapes there, too, including an *unlimited number* of escaped newlines. The following rules are completely equivalent as far as the CSS parser is concerned: @charset "utf-8"; @charset "\ \75\ \74\ \76\ \2d\ \38\ "; And let's not forget the part between the at-keyword and the string. Note that grammar definitions automatically allow any amount of whitespace to be put between tokens, *including no whitespace at all* (unless otherwise specified). The only purpose whitespace serves is to separate the tokens, and comments do the same job. The following rules are functionally equivalent as far as the CSS parser is concerned: @charset "utf-8"; @charset/**/"utf-8"; @charset "utf-8"; (pretend that these spaces are a mixture of spaces and tabs) @charset "utf-8"; (newlines are valid whitespace!) CSS parsing is much more complex than you realize, and maintaining parity with the full parser is a rather difficult job. Invoking all of this complexity for the purpose of determining an encoding is *not* a good idea. The most reasonable (and most likely to be correct!) implementation strategy becomes to just assume the stylesheet is in utf-8, parse it, check for a @charset rule, and re-parse if your assumption was wrong. Since this is ridiculously inefficient, we'll instead get impls manually implementing *some* of the flexibility that CSS allows, so authors doing something a little weird (as authors are wont to do - there's a *lot* of them) will get their encoding detected in some UAs and not others. While I can't think of any attack scenarios off the top of my head, encoding confusion often leads to unforeseen vulnerabilities. So, in sum: 1. Nobody should be using @charset in the first place. We only retain it for legacy purposes, and new stylesheets should just be done in utf-8. 2. There is a realistic concern that we're already under legacy constraints to not loosen the syntax. 3. CSS parsing allows for *far* more variation than just "more spaces and either type of quote". 4. UAs are very unlikely to implement the full flexibility of CSS parsing just for encoding detection. 5. If we specify only a subset of allowed variation, the original goal of making encoding detection aligned with valid @charset rules is still not satisfied. For all these reasons, I strongly reject any proposal to change the current specification regarding the strictness of the encoding declaration syntax. ~TJ
Received on Thursday, 23 January 2014 18:19:04 UTC