Re: Two new encoding related articles for review from Zack Weinberg on 2014-03-14 (www-style@w3.org from March 2014)

From: Zack Weinberg <zackw@panix.com>
Date: Fri, 14 Mar 2014 12:13:14 -0400
To: Richard Ishida <ishida@w3.org>
Cc: www International <www-international@w3.org>, W3C Style <www-style@w3.org>, "HTML WG (public-html@w3.org)" <public-html@w3.org>
Message-ID: <CAKCAbMjE+4+JO--7TFDDwirL87TTyLrQ2aDn8+46VDA+xbi7ZA@mail.gmail.com>
I'd like to suggest that the "Avoid these encodings" section at the
bottom of the "Choosing and applying a character set" document should
be merged into the "Choosing an encoding" section at the top of that
document.  You are saying the same thing in two places but slightly
differently (leading to confusion), and the "Avoid these encodings"
section is (IMHO) one of the most important bits of the document - it
should be up front.

I'd write it like this:

## Choosing an encoding

Encode new content in UTF-8.  All of the present generation of Web
standards, servers, clients, and libraries are designed to work best
with UTF-8, and it allows you to use the same encoding for all of your
content regardless of language.  If you have a corpus of "legacy"
content in some other encoding, you are strongly encouraged to convert
it within your server and send clients UTF-8 anyway.

If it is *impossible* for you to send UTF-8 over the network, you need
to be aware that many other historical encodings are poorly, or not at
all, supported by Web clients.  [The Encoding Standard] contains an
*exhaustive* list of "legacy" character encodings that are supported:
anything not in the list simply will not work.

Furthermore, UTF-32, UTF-16, JIS_C6226-1983, JIS_X0212-1990,
HZ-GB-2312, JOHAB (Windows code page 1361), CESU-8, UTF-7, BOCU-1,
SCSU, ISO-2022 (all varieties), and EBCDIC (all varieties) MUST NOT be
used.  These encodings are *ASCII-incompatible* -- that is, in these
encodings, octets with values 00 through 7F (hexadecimal) are not
always interpreted as Unicode code points U+0000 through U+007F.  This
has historically been a source of security vulnerabilities.

The Big5 and EUC-JP encodings suffer from interoperability problems
due to the large number of incompatible variants "in the wild", and
should be avoided.  ISO-8859-8 ("visually ordered" Hebrew) should also
be avoided; if UTF-8 cannot be used for Hebrew, use ISO-8859-8-i,
which like Unicode is "logically ordered".

The "replacement" encoding, listed in the Encoding Standard, is not
actually an encoding; it is a fallback that maps every octet to U+FFFD
REPLACEMENT CHARACTER.  Obviously, it is not useful to transmit data
in this encoding.  The "x-user-defined" encoding is a single-byte
encoding whose lower half is ASCII and whose upper half is mapped into
the Unicode Private Use Area.  Like the PUA in general, using this
encoding on the public Internet is best avoided.

---

The other document ("Declaring character encodings in CSS") looks good
to me, except for one technical point that needs clarified: If there
is a byte order mark, that means the '@' in '@charset' is not the
first byte of the stylesheet, and therefore the @charset directive is
ineffective.  (Unless the bit about IE 10 and 11 means that they skip
the BOM when looking for @charset?)

zw


On Fri, Mar 7, 2014 at 7:49 AM, Richard Ishida <ishida@w3.org> wrote:
> Following on from the revision of the i18n article about encoding
> declarations in HTML (that review period ends today), I have revised and
> updated two further articles:
>
> Choosing & applying a character encoding
> http://www.w3.org/International/questions/qa-choosing-encodings-new
>
> Declaring character encodings in CSS
> http://www.w3.org/International/questions/qa-css-charset-new
>
> Please take a look and send any comments to www-international@w3.org before
> 14th March.
>
> Thanks,
> RI
>
>
>
> On 28/02/2014 14:20, Richard Ishida wrote:
>> An updated version of Declaring character encodings in HTML[1] is out
>> for review at
>>
>>
>> http://www.w3.org/International/questions/qa-html-encoding-declarations-new
>>
>> We are looking for comments before 7 March. Please send comments to
>> www-international@w3.org.
>>
>> After the review period is over, this content will be copied to the same
>> location as the current version of the document, ie.
>>
>> http://www.w3.org/International/questions/qa-html-encoding-declarations
>>
>> and the URL of the updated version will cease to exist.
>>
>> The update brings the article in line with recent developments in HTML5,
>> and de-emphasizes information about legacy formats.
>>
>> An attempt was also made to organize the material so that readers can
>> find information more quickly, and also de-clutter the essential
>> information by moving edge topics, such as UTF-16 and charset links,
>> down the page. This led to the article being almost completely rewritten.
>>
>>
>>
>> RI
>
Received on Friday, 14 March 2014 16:13:46 UTC