Re: Two new encoding related articles for review from Richard Ishida on 2014-03-17 (www-international@w3.org from January to March 2014)

From: Richard Ishida <ishida@w3.org>
Date: Mon, 17 Mar 2014 14:25:58 +0000
To: Zack Weinberg <zackw@panix.com>
CC: www International <www-international@w3.org>, W3C Style <www-style@w3.org>, "HTML WG (public-html@w3.org)" <public-html@w3.org>
Message-ID: <532705F6.8010907@w3.org>

Zack,

Thanks for these suggestions.  I take your point about problem with the 
inexact duplication of lists of encodings at top and bottom of the 
article. My preferred solution, however, is to do the opposite of what 
you suggest and remove the list from the top of the document. This is 
because that section is intended to be a short answer, strongly 
encouraging the use of UTF-8, and reinforcing the idea that use of 
legacy encodings is only for unusual situations.  I did, however, reword 
some of the text that points to the section lower down, to ensure that 
those who cannot use UTF-8 can find further information easily.

These guidelines have to help people with limited knowledge of HTML and 
no interest in the intricacies of encodings (quite a large percentage of 
content authors and developers) as well as those who are reasonably 
technical, and that's one reason for this organization of material. 
Another is that people typically want as quick an answer as possible 
when looking things up, with the opportunity to delve deeper only if 
needed. So we layer and signpost the information in an attempt to enable 
that.

I also use some of the text you wrote for the 'Avoid these encodings' 
section, since it offers some useful explanations that I think are 
worthwhile.

Cheers,
RI

On 14/03/2014 16:13, Zack Weinberg wrote:
> I'd like to suggest that the "Avoid these encodings" section at the
> bottom of the "Choosing and applying a character set" document should
> be merged into the "Choosing an encoding" section at the top of that
> document.  You are saying the same thing in two places but slightly
> differently (leading to confusion), and the "Avoid these encodings"
> section is (IMHO) one of the most important bits of the document - it
> should be up front.
>
> I'd write it like this:
>
> ## Choosing an encoding
>
> Encode new content in UTF-8.  All of the present generation of Web
> standards, servers, clients, and libraries are designed to work best
> with UTF-8, and it allows you to use the same encoding for all of your
> content regardless of language.  If you have a corpus of "legacy"
> content in some other encoding, you are strongly encouraged to convert
> it within your server and send clients UTF-8 anyway.
>
> If it is*impossible*  for you to send UTF-8 over the network, you need
> to be aware that many other historical encodings are poorly, or not at
> all, supported by Web clients.  [The Encoding Standard] contains an
> *exhaustive*  list of "legacy" character encodings that are supported:
> anything not in the list simply will not work.
>
> Furthermore, UTF-32, UTF-16, JIS_C6226-1983, JIS_X0212-1990,
> HZ-GB-2312, JOHAB (Windows code page 1361), CESU-8, UTF-7, BOCU-1,
> SCSU, ISO-2022 (all varieties), and EBCDIC (all varieties) MUST NOT be
> used.  These encodings are*ASCII-incompatible*  -- that is, in these
> encodings, octets with values 00 through 7F (hexadecimal) are not
> always interpreted as Unicode code points U+0000 through U+007F.  This
> has historically been a source of security vulnerabilities.
>
> The Big5 and EUC-JP encodings suffer from interoperability problems
> due to the large number of incompatible variants "in the wild", and
> should be avoided.  ISO-8859-8 ("visually ordered" Hebrew) should also
> be avoided; if UTF-8 cannot be used for Hebrew, use ISO-8859-8-i,
> which like Unicode is "logically ordered".
>
> The "replacement" encoding, listed in the Encoding Standard, is not
> actually an encoding; it is a fallback that maps every octet to U+FFFD
> REPLACEMENT CHARACTER.  Obviously, it is not useful to transmit data
> in this encoding.  The "x-user-defined" encoding is a single-byte
> encoding whose lower half is ASCII and whose upper half is mapped into
> the Unicode Private Use Area.  Like the PUA in general, using this
> encoding on the public Internet is best avoided.

Received on Monday, 17 March 2014 14:26:31 UTC