Re: Two new encoding related articles for review from Glenn Adams on 2014-03-15 (www-style@w3.org from March 2014)

From: Glenn Adams <glenn@skynav.com>
Date: Fri, 14 Mar 2014 20:42:16 -0600
To: Zack Weinberg <zackw@panix.com>
Cc: Richard Ishida <ishida@w3.org>, www International <www-international@w3.org>, W3C Style <www-style@w3.org>, "HTML WG (public-html@w3.org)" <public-html@w3.org>
Message-ID: <CACQ=j+dVrOM-XA_exuEGVUsnjg7jDSbKAR3iXtbUePL1S715Wg@mail.gmail.com>
On Fri, Mar 14, 2014 at 10:13 AM, Zack Weinberg <zackw@panix.com> wrote:

> I'd like to suggest that the "Avoid these encodings" section at the
> bottom of the "Choosing and applying a character set" document should
> be merged into the "Choosing an encoding" section at the top of that
> document.  You are saying the same thing in two places but slightly
> differently (leading to confusion), and the "Avoid these encodings"
> section is (IMHO) one of the most important bits of the document - it
> should be up front.
>
> I'd write it like this:
>
> ## Choosing an encoding
>
> Encode new content in UTF-8.  All of the present generation of Web
> standards, servers, clients, and libraries are designed to work best
> with UTF-8, and it allows you to use the same encoding for all of your
> content regardless of language.  If you have a corpus of "legacy"
> content in some other encoding, you are strongly encouraged to convert
> it within your server and send clients UTF-8 anyway.
>
> If it is *impossible* for you to send UTF-8 over the network, you need
> to be aware that many other historical encodings are poorly, or not at
> all, supported by Web clients.  [The Encoding Standard] contains an
> *exhaustive* list of "legacy" character encodings that are supported:
> anything not in the list simply will not work.
>
> Furthermore, UTF-32, UTF-16, JIS_C6226-1983, JIS_X0212-1990,
> HZ-GB-2312, JOHAB (Windows code page 1361), CESU-8, UTF-7, BOCU-1,
> SCSU, ISO-2022 (all varieties), and EBCDIC (all varieties) MUST NOT be
> used.  These encodings are *ASCII-incompatible* -- that is, in these
> encodings, octets with values 00 through 7F (hexadecimal) are not
> always interpreted as Unicode code points U+0000 through U+007F.  This
> has historically been a source of security vulnerabilities.
>

It seems strange for a guideline to say "MUST NOT". I would suggest SHOULD
NOT is more appropriate. In any case, we shouldn't be in the business of
telling content authors what they can or can't do. If they want to use an
encoding that isn't well supported, then the risk is theirs.


>
> The Big5 and EUC-JP encodings suffer from interoperability problems
> due to the large number of incompatible variants "in the wild", and
> should be avoided.  ISO-8859-8 ("visually ordered" Hebrew) should also
> be avoided; if UTF-8 cannot be used for Hebrew, use ISO-8859-8-i,
> which like Unicode is "logically ordered".
>
> The "replacement" encoding, listed in the Encoding Standard, is not
> actually an encoding; it is a fallback that maps every octet to U+FFFD
> REPLACEMENT CHARACTER.  Obviously, it is not useful to transmit data
> in this encoding.  The "x-user-defined" encoding is a single-byte
> encoding whose lower half is ASCII and whose upper half is mapped into
> the Unicode Private Use Area.  Like the PUA in general, using this
> encoding on the public Internet is best avoided.
>
> ---
>
> The other document ("Declaring character encodings in CSS") looks good
> to me, except for one technical point that needs clarified: If there
> is a byte order mark, that means the '@' in '@charset' is not the
> first byte of the stylesheet, and therefore the @charset directive is
> ineffective.  (Unless the bit about IE 10 and 11 means that they skip
> the BOM when looking for @charset?)
>
> zw
>
>
> On Fri, Mar 7, 2014 at 7:49 AM, Richard Ishida <ishida@w3.org> wrote:
> > Following on from the revision of the i18n article about encoding
> > declarations in HTML (that review period ends today), I have revised and
> > updated two further articles:
> >
> > Choosing & applying a character encoding
> > http://www.w3.org/International/questions/qa-choosing-encodings-new
> >
> > Declaring character encodings in CSS
> > http://www.w3.org/International/questions/qa-css-charset-new
> >
> > Please take a look and send any comments to www-international@w3.orgbefore
> > 14th March.
> >
> > Thanks,
> > RI
> >
> >
> >
> > On 28/02/2014 14:20, Richard Ishida wrote:
> >> An updated version of Declaring character encodings in HTML[1] is out
> >> for review at
> >>
> >>
> >>
> http://www.w3.org/International/questions/qa-html-encoding-declarations-new
> >>
> >> We are looking for comments before 7 March. Please send comments to
> >> www-international@w3.org.
> >>
> >> After the review period is over, this content will be copied to the same
> >> location as the current version of the document, ie.
> >>
> >> http://www.w3.org/International/questions/qa-html-encoding-declarations
> >>
> >> and the URL of the updated version will cease to exist.
> >>
> >> The update brings the article in line with recent developments in HTML5,
> >> and de-emphasizes information about legacy formats.
> >>
> >> An attempt was also made to organize the material so that readers can
> >> find information more quickly, and also de-clutter the essential
> >> information by moving edge topics, such as UTF-16 and charset links,
> >> down the page. This led to the article being almost completely
> rewritten.
> >>
> >>
> >>
> >> RI
> >
>
>
Received on Saturday, 15 March 2014 02:43:05 UTC