Re: [css-syntax] ISSUE-329: @charset has no effect on stylesheet?? from Zack Weinberg on 2014-01-24 (www-style@w3.org from January 2014)

From: Zack Weinberg <zackw@panix.com>
Date: Fri, 24 Jan 2014 10:12:16 -0500
To: Simon Pieters <simonp@opera.com>, www-style list <www-style@w3.org>, www International <www-international@w3.org>, "Tab Atkins Jr." <jackalmage@gmail.com>
Message-ID: <52E282D0.7010608@panix.com>
On 2014-01-24 3:58 AM, Simon Pieters wrote:
> On Thu, 23 Jan 2014 21:35:28 +0100, Zack Weinberg <zackw@panix.com>
> wrote:
>
>> #### Summary of how style sheet encoding is determined
>>
>>> This section is non-normative.
>>
>> [UTF-8]() is the default character encoding for CSS.
>
> I think this is a confusing statement. It sounds like if you don't
> specify an encoding, you get utf-8.

That is technically true but you're right that it's misleading - there's
basically always going to be an environment encoding, so the ultimate
default being UTF-8 hardly ever matters in practice.

... What I wrote perhaps suffers from me not actually believing Tab's
argument that new stylesheets can just use UTF-8 and not bother with
@charset.  Assuming for this discussion that the "can just use UTF-8"
part is not a problem, leaving out the @charset only works if every
referring document also uses UTF-8, or if the server is configured to
send Content-Type directives with an encoding annotation.  It has been
my experience that neither of these can actually be relied on in
practice (you would not *believe* how many web developers have told me
that HTTP headers are completely out of their control!)  So what I
actually think is that encoding should always be annotated in-band, even
if it is "the" encoding.  (The Encoding Standard seems to have gotten a
bit ... evangelical about UTF-8.)

But if our advice-to-authors is that they should put '@charset "utf-8";'
on all new stylesheets -- and that *is* what I think we should be
advising -- then I am a good deal more sympathetic to the I18N group's
concerns about the restricted syntax of @charset than I previously
indicated.  To this point, I would love to see some statistics on
@charset from a major-search-engine-scale crawl of the public 'net.  I
think that's the only way we can make an informed decision about whether
legacy compatibility constraints preclude changing it.

(I suppose a position consistent with all of the above would be 'for new 
content, use utf-8 and put a byte order mark on it' but I reject this 
because I don't like invisible in-band metadata any more than I like 
out-of-band metadata.)

To some of your other comments:

> Why is [HTTP metadata] the preferred mechanism for utf-8 (but not for other
> encodings?)?

I have the impression it's just preferred in general.  By implementors, 
anyway.  Not at all by authors.

>> * The referring document provides, explicitly or implicitly, an
>> [environment encoding]() which is assumed to apply to the style
>> sheet if neither of the above mechanisms provide an encoding.
>> Relying on the environment encoding is discouraged.
>
> Why is it discouraged?

Suppose you have a carefully written, UTF-8-encoded stylesheet that gets 
applied to all content on your site, and you use non-ASCII characters in 
it somewhere that they'll show up in rendered pages (perhaps in the 
quotes: or content: properties).  And it works great in the test 
environment, which contains only UTF-8 documents, and you believe Tab 
and you don't bother with @charset.  But then it gets uploaded to 
production, and the production CMS has hundreds of legacy documents that 
no one dares touch, and they're all Windows-1252.  Boom, all your nice 
quotation marks or whatever are now mojibake.

> Why is [UTF-16] more strongly discouraged than other non-utf-8 encodings?
> Since utf-8 is already must, I think it doesn't make sense to
> discourage other specific encodings.

I think I'm going to take the "mandated for new content by [ENCODING]" 
part back out.  But even if we keep that, I would want to deprecate 
UTF-16 more strongly than other legacy encodings, because it is 
ASCII-incompatible.

----
Revised proposal:
----

### The input byte stream

When parsing a stylesheet that was not embedded in some larger
document, the stream of Unicode [code points]() that comprises the
input to the tokenization stage may be initially seen by the user
agent as a stream of bytes (typically coming over the network or from
the local file system). To decode the stream of bytes into a stream of
[code points](), UAs must use the [decode]() algorithm defined in
[[ENCODING]](). This algorithm detects some encodings itself and
relies on contextual information in other cases.

Use of [UTF-8]() for new style sheets is strongly encouraged.  With the 
exception of [UTF-16](), use of which is strongly *discouraged*, the 
encoding of a style sheet must be [ASCII-compatible]() as defined in 
[[ENCODING]]().

#### Summary of how style sheet encoding is determined

 > This section is non-normative.

This section summarizes the algorithm that UAs use to determine the 
encoding of an input byte stream.  The following list is in descending 
order of priority: earlier items on the list override later items, if 
both are present.

  * If present, a [byte order mark]() can indicate encoding in
    either form of UTF-16 (big- or little-endian) or UTF-8.
    As specified in [[ENCODING]](), a byte order mark overrides
    all other information about the encoding of a style sheet.

  * The network protocol may provide the encoding of the character
    sheet as metadata (e.g. via the Content-Type response header for
    HTTP.)

  * ASCII-compatible encodings may be indicated in-band by use of
    an [@charset directive]().  This directive is only honored if
    there is neither a byte order mark nor an encoding provided by
    metadata.

    > Warning: Although an [@charset directive]() textually resembles
    > an [at-rule](), it is not parsed as an at-rule; only a specific
    > byte sequence, beginning with the very first byte in the style
    > sheet, will be effective.  Variations, even those that would be
    > valid for a normal at-rule with the same syntax, are silently
    > ignored.

  * The referring document may provide, explicitly or implicitly, an
    [environment encoding]() which is assumed to apply to the style
    sheet if none of the above mechanisms provide an encoding.
    Relying on the environment encoding is discouraged.

  * If there is no [environment encoding](), the ultimate default is
    UTF-8.

#### Algorithm for determining the fallback encoding

The [decode]() algorithm takes as input a <dfn>fallback
encoding</dfn>, which UAs shall determine as follows:

 > Note: The [decode]() algorithm uses the [fallback encoding]() only
 > when no [byte order mark]() is present in the input.

1. If HTTP or equivalent protocol defines an encoding (e.g. via the
    charset parameter of the Content-Type header), [get an encoding]()
    [[ENCODING]]() for the specified value. If that does not return
    failure, use the return value as the fallback encoding.

1. Otherwise, check for a <dfn>@charset directive</dfn>.  If the
    initial sequence of bytes in the byte stream, beginning with the
    very first byte, matches the hex sequence

         40 63 68 61 72 73 65 74 20 22 LL* 22 3B

    where each `LL` byte must have a value between `23` and `7E`
    hexadecimal, inclusive, then [get an encoding]() [[ENCODING]]() for
    the sequence of `LL` bytes, interpreted as ASCII.

    > Note: This byte sequence, when decoded as ASCII, is the string
    > ‘`@charset "…";`’ where the "…" is the sequence of `LL` bytes
    > specifying the encoding’s label.

    > Note: UAs may impose an arbitrary limit upon the number of `LL`
    > bytes scanned, as long as it is large enough to encompass all of
    > the [labels]() defined in [[ENCODING]](); presently these are all
    > 19 or fewer bytes long.

    If the [get an encoding]() algorithm returns `utf-16be` or
    `utf-16le`, use `utf-8` as the fallback encoding.  If it returns
    anything else except failure, use the return value as the fallback
    encoding.

    > Note: `utf-16be` and `utf-16le` cannot possibly be correct when
    > returned by the [get an encoding]() algorithm in this context,
    > because they are ASCII-incompatible and the [@charset directive]()
    > is only recognized when encoded compatibly with ASCII.
    > This mimics the behavior of HTML `<meta>` elements when used to
    > declare an encoding in-band.

1. Otherwise, if an [environment encoding]() is provided by the
    referring document, use that as the fallback encoding.

1. Otherwise, use `utf-8` as the fallback encoding.
Received on Friday, 24 January 2014 15:12:51 UTC