Re: [css-syntax] ISSUE-329: @charset has no effect on stylesheet?? from Zack Weinberg on 2014-01-23 (www-international@w3.org from January to March 2014)

From: Zack Weinberg <zackw@panix.com>
Date: Thu, 23 Jan 2014 15:35:28 -0500
To: www-style list <www-style@w3.org>, www International <www-international@w3.org>, "Tab Atkins Jr." <jackalmage@gmail.com>
Message-ID: <CAKCAbMjmbP=0_txD0tUoM_BmOGy2f70mNwMCYvs9xN5tY+A7Hw@mail.gmail.com>
On Thu, Jan 23, 2014 at 2:21 PM, Asmus Freytag <asmusf@ix.netcom.com> wrote:
> On 1/23/2014 10:18 AM, Tab Atkins Jr. wrote:
>> For all these reasons, I strongly reject any proposal to change the
>> current specification regarding the strictness of the encoding
>> declaration syntax.
>
> Will the spec be written accordingly?

If you're an implementor, the logic is already clear (you have to read
both css-syntax and the Encoding Standard, though).  I think we could
probably use some advice-targeted-at-authors, though.  I think the
definition of the @charset directive could also be clearer.  Tab, what
do you think of this rewrite of section 3.2? (pseudo-Markdown)  The
most important bit is the new non-normative summary, which I think is
in terms that will be much clearer to authors.  I also tweaked the
wording of the actual algorithm a bit.

### The input byte stream

When parsing a stylesheet that was not embedded in some larger
document, the stream of Unicode [code points]() that comprises the
input to the tokenization stage may be initially seen by the user
agent as a stream of bytes (typically coming over the network or from
the local file system). To decode the stream of bytes into a stream of
[code points](), UAs must use the [decode]() algorithm defined in
[[ENCODING]](). This algorithm detects some encodings itself and
relies on contextual information in other cases.

#### Summary of how style sheet encoding is determined

> This section is non-normative.

[UTF-8]() is the default character encoding for CSS.  The use of
[UTF-8]() for new style sheets is mandated by [[ENCODING]]().  When
legacy requirements dictate the use of some other encoding, either for
the style sheet or some or all of its referring documents, authors may
set the encoding as follows:

 * The network protocol (e.g. HTTP) may supply an encoding for the
   character sheet as metadata; when available, use of this mechanism
   is preferred.  New content encoded in [UTF-8]() should be marked as
   such using this mechanism.

 * ASCII-compatible encodings may also be declared in-band by use of
   an [@charset directive]().  This directive is ignored if the
   network protocol supplies an encoding as metadata.

   > Warning: Although an [@charset directive]() textually resembles
   > an [at-rule](), it is not parsed as an at-rule; only a specific
   > byte sequence, beginning with the very first byte in the style
   > sheet, is accepted.

 * The referring document provides, explicitly or implicitly, an
   [environment encoding]() which is assumed to apply to the style
   sheet if neither of the above mechanisms provide an encoding.
   Relying on the environment encoding is discouraged.

 * [UTF-16]() encoding, which is not ASCII-compatible, may be declared
   out-of-band with network data or in-band with a [byte order mark](),
   but not with a [@charset directive]().  The use of [UTF-16]() is
   **strongly discouraged**.

   When present, a [byte order mark]() overrides any encoding set by
   network metadata, as specified in [[ENCODING]]().

 * ASCII-incompatible encodings other than [UTF-16]() may not be
   used, as specified in [[ENCODING]]().

#### Algorithm for determining the fallback encoding

The [decode]() algorithm takes as input a <dfn>fallback
encoding</dfn>, which UAs shall determine as follows:

> Note: The [decode]() algorithm uses the [fallback encoding]() only
> when no [byte order mark]() is present in the input.

1. If HTTP or equivalent protocol defines an encoding (e.g. via the
   charset parameter of the Content-Type header), [get an encoding]()
   [[ENCODING]]() for the specified value. If that does not return
   failure, use the return value as the fallback encoding.

1. Otherwise, check for a <dfn>@charset directive</dfn>.  If the
   initial sequence of bytes in the byte stream, beginning with the
   very first byte, matches the hex sequence

        40 63 68 61 72 73 65 74 20 22 LL* 22 3B

   where each `LL` byte must have a value between `23` and `7E`
   hexadecimal, inclusive, then [get an encoding]() [[ENCODING]]() for
   the sequence of `LL` bytes, interpreted as ASCII.

   > Note: This byte sequence, when decoded as ASCII, is the string
   > ‘`@charset "…";`’ where the "…" is the sequence of `LL` bytes
   > specifying the encoding’s label.

   > Note: UAs may impose an arbitrary limit upon the number of `LL`
   > bytes scanned, as long as it is large enough to encompass all of
   > the [labels]() defined in [[ENCODING]](); presently these are all
   > 19 or fewer bytes long.

   If the [get an encoding]() algorithm returns `utf-16be` or
   `utf-16le`, use `utf-8` as the fallback encoding.  If it returns
   anything else except failure, use the return value as the fallback
   encoding.

   > Note: `utf-16be` and `utf-16le` cannot possibly be correct when
   > returned by the [get an encoding]() algorithm in this context,
   > because they are ASCII-incompatible and the [@charset directive]()
   > is only recognized when encoded compatibly with ASCII.
   > This mimics the behavior of HTML `<meta>` elements when used to
   > declare an encoding in-band.

1. Otherwise, if an [environment encoding]() is provided by the
   referring document, use that as the fallback encoding.

1. Otherwise, use `utf-8` as the fallback encoding.
Received on Thursday, 23 January 2014 20:35:54 UTC