Re: [css-syntax] ISSUE-329: @charset has no effect on stylesheet?? from Tab Atkins Jr. on 2014-01-23 (www-style@w3.org from January 2014)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Thu, 23 Jan 2014 10:18:13 -0800
To: Asmus Freytag <asmusf@ix.netcom.com>
Cc: Simon Sapin <simon.sapin@exyr.org>, "Phillips, Addison" <addison@lab126.com>, Anne van Kesteren <annevk@annevk.nl>, Richard Ishida <ishida@w3.org>, Zack Weinberg <zackw@panix.com>, www-style list <www-style@w3.org>, www International <www-international@w3.org>
Message-ID: <CAAWBYDButYs5Suy8uaJ7rbp8h6YXVrJXSn6CzavQjZmf6VgkTg@mail.gmail.com>
On Thu, Jan 23, 2014 at 9:32 AM, Asmus Freytag <asmusf@ix.netcom.com> wrote:
> Presumably the @charset continues to exist because there are situations
> where it's needed.

Yup, legacy stylesheets.

> If the only conceivably situations are parsing of already existing
> stylesheets, then an argument that any change that affects the
> interpretation of any of these stylesheets is inherently risky and, in a
> strict sense would violate the backwards compatibility, does have some
> merit. (But remember, we are talking about making them behave as declared,
> not behave in an unusual new way.)
>
> On the other hand, if there are any situations where someone needs to create
> new stylesheets that are not in UTF-8 (for whatever reason) and if the
> intent is to support newly created stylesheet that are not UTF-8 in such
> cases, then having a syntax that is a "trap" is counter productive. Those
> for whom the support of new non-UTF-8 stylesheets is perpetuated would be
> likely to fail in correctly availing themselves of that feature, because
> they happen to fall into the trap.

There is no reason to create a new stylesheet in any encoding other
than utf-8.  We need to get out of the trap of thinking that encodings
are in any way valuable.  They're a legacy pain, and we've fixed the
situation in practice by standardizing on a single encoding.

> If the intent is to force people to use UTF-8, then the way to do that would
> be to disregard the @charset declaration for any stylesheet that uses a
> feature not present in legacy stylesheets. That would be a clean solution.

No, that makes upgrading a legacy stylesheet fraught with danger.  The
presence of a "new" feature is not a version indicator, and CSS
doesn't have explicit version indicators to rely on either.

> Simply leaving a "gotcha" strikes me as suboptimal. It's not about "making
> it more convenient", but about making the process more robust, by separating
> the status of the feature (more or less deprecated) from accidents in
> syntax.

If an accident causes people to be frustrated and just write their
stylesheets in utf-8, that's a win. ^_^

More importantly, the @charset rule has *always* worked this way, with
the precise requirements on syntax.  There are likely legacy
stylesheets in the wild containing incorrectly written @charset rules,
which are depending on it not working.  (For example, they may have
pasted in some unicode characters, and then fiddled with their text
editor until it worked, ending up accidentally saving it in utf-8.)
Changing the syntax thus has potential (and realistic) compatibility
constraints.

Note as well that the suggested changes (allowing any number of spaces
between "@charset" and the string, and allowing either single or
double quotes) still does *not* make it sufficiently robust.  It's
still trivial to construct an @charset rule that won't give any
encoding information:

@charset /* comment! */ "windows-1252";

Since comments can be an arbitrary length, accommodating this means
that the encoding scanner has to look an *arbitrary distance* into the
document in the general case, before returning to the beginning and
parsing the stylesheet for real. This is not the type of thing we want
to encourage.

But that's not all!  CSS is even more flexible than this!  The syntax
of an at-keyword token is basically just "@" followed by an ident.  In
particular, you can escape characters in the ident part, and it's
equivalent to the non-escaped version.  That means that the following
are all *exactly equivalent* as far as the CSSOM and grammar validity
are concerned:

@charset
@\63harset
@\63 harset (To avoid ambiguous situations, an optional space is
allowed after the escape!)
@\63\68\61\72s\65\74

The same applies to the contents of strings - you can use CSS escapes
there, too, including an *unlimited number* of escaped newlines.  The
following rules are completely equivalent as far as the CSS parser is
concerned:

@charset "utf-8";
@charset "\
\75\
\74\
\76\
\2d\
\38\
";

And let's not forget the part between the at-keyword and the string.
Note that grammar definitions automatically allow any amount of
whitespace to be put between tokens, *including no whitespace at all*
(unless otherwise specified).  The only purpose whitespace serves is
to separate the tokens, and comments do the same job.  The following
rules are functionally equivalent as far as the CSS parser is
concerned:

@charset "utf-8";
@charset/**/"utf-8";
@charset        "utf-8"; (pretend that these spaces are a mixture of
spaces and tabs)
@charset


"utf-8";  (newlines are valid whitespace!)

CSS parsing is much more complex than you realize, and maintaining
parity with the full parser is a rather difficult job.  Invoking all
of this complexity for the purpose of determining an encoding is *not*
a good idea.  The most reasonable (and most likely to be correct!)
implementation strategy becomes to just assume the stylesheet is in
utf-8, parse it, check for a @charset rule, and re-parse if your
assumption was wrong.  Since this is ridiculously inefficient, we'll
instead get impls manually implementing *some* of the flexibility that
CSS allows, so authors doing something a little weird (as authors are
wont to do - there's a *lot* of them) will get their encoding detected
in some UAs and not others.  While I can't think of any attack
scenarios off the top of my head, encoding confusion often leads to
unforeseen vulnerabilities.

So, in sum:

1. Nobody should be using @charset in the first place. We only retain
it for legacy purposes, and new stylesheets should just be done in
utf-8.
2. There is a realistic concern that we're already under legacy
constraints to not loosen the syntax.
3. CSS parsing allows for *far* more variation than just "more spaces
and either type of quote".
4. UAs are very unlikely to implement the full flexibility of CSS
parsing just for encoding detection.
5. If we specify only a subset of allowed variation, the original goal
of making encoding detection aligned with valid @charset rules is
still not satisfied.

For all these reasons, I strongly reject any proposal to change the
current specification regarding the strictness of the encoding
declaration syntax.

~TJ
Received on Thursday, 23 January 2014 18:19:04 UTC