Re: Two new encoding related articles for review from Andrew Cunningham on 2014-03-15 (public-html@w3.org from March 2014)

From: Andrew Cunningham <acunningham@slv.vic.gov.au>
Date: Sat, 15 Mar 2014 16:17:54 +1100
To: Glenn Adams <glenn@skynav.com>
Cc: "HTML WG (public-html@w3.org)" <public-html@w3.org>, W3C Style <www-style@w3.org>, Richard Ishida <ishida@w3.org>, www International <www-international@w3.org>, Zack Weinberg <zackw@panix.com>
Message-ID: <CAOUP6Kk1AMXcig5BYwS1y-sNOWSHxXF7=DG8tEZQwxjL2RKgUQ@mail.gmail.com>
There are times when ascii incompatible legacy encodings are the only
workable choice.

Andrew
On 15/03/2014 1:42 PM, "Glenn Adams" <glenn@skynav.com> wrote:

>
>
>
> On Fri, Mar 14, 2014 at 10:13 AM, Zack Weinberg <zackw@panix.com> wrote:
>
>> I'd like to suggest that the "Avoid these encodings" section at the
>> bottom of the "Choosing and applying a character set" document should
>> be merged into the "Choosing an encoding" section at the top of that
>> document.  You are saying the same thing in two places but slightly
>> differently (leading to confusion), and the "Avoid these encodings"
>> section is (IMHO) one of the most important bits of the document - it
>> should be up front.
>>
>> I'd write it like this:
>>
>> ## Choosing an encoding
>>
>> Encode new content in UTF-8.  All of the present generation of Web
>> standards, servers, clients, and libraries are designed to work best
>> with UTF-8, and it allows you to use the same encoding for all of your
>> content regardless of language.  If you have a corpus of "legacy"
>> content in some other encoding, you are strongly encouraged to convert
>> it within your server and send clients UTF-8 anyway.
>>
>> If it is *impossible* for you to send UTF-8 over the network, you need
>> to be aware that many other historical encodings are poorly, or not at
>> all, supported by Web clients.  [The Encoding Standard] contains an
>> *exhaustive* list of "legacy" character encodings that are supported:
>> anything not in the list simply will not work.
>>
>> Furthermore, UTF-32, UTF-16, JIS_C6226-1983, JIS_X0212-1990,
>> HZ-GB-2312, JOHAB (Windows code page 1361), CESU-8, UTF-7, BOCU-1,
>> SCSU, ISO-2022 (all varieties), and EBCDIC (all varieties) MUST NOT be
>> used.  These encodings are *ASCII-incompatible* -- that is, in these
>> encodings, octets with values 00 through 7F (hexadecimal) are not
>> always interpreted as Unicode code points U+0000 through U+007F.  This
>> has historically been a source of security vulnerabilities.
>>
>
> It seems strange for a guideline to say "MUST NOT". I would suggest SHOULD
> NOT is more appropriate. In any case, we shouldn't be in the business of
> telling content authors what they can or can't do. If they want to use an
> encoding that isn't well supported, then the risk is theirs.
>
>
>>
>> The Big5 and EUC-JP encodings suffer from interoperability problems
>> due to the large number of incompatible variants "in the wild", and
>> should be avoided.  ISO-8859-8 ("visually ordered" Hebrew) should also
>> be avoided; if UTF-8 cannot be used for Hebrew, use ISO-8859-8-i,
>> which like Unicode is "logically ordered".
>>
>> The "replacement" encoding, listed in the Encoding Standard, is not
>> actually an encoding; it is a fallback that maps every octet to U+FFFD
>> REPLACEMENT CHARACTER.  Obviously, it is not useful to transmit data
>> in this encoding.  The "x-user-defined" encoding is a single-byte
>> encoding whose lower half is ASCII and whose upper half is mapped into
>> the Unicode Private Use Area.  Like the PUA in general, using this
>> encoding on the public Internet is best avoided.
>>
>> ---
>>
>> The other document ("Declaring character encodings in CSS") looks good
>> to me, except for one technical point that needs clarified: If there
>> is a byte order mark, that means the '@' in '@charset' is not the
>> first byte of the stylesheet, and therefore the @charset directive is
>> ineffective.  (Unless the bit about IE 10 and 11 means that they skip
>> the BOM when looking for @charset?)
>>
>> zw
>>
>>
>> On Fri, Mar 7, 2014 at 7:49 AM, Richard Ishida <ishida@w3.org> wrote:
>> > Following on from the revision of the i18n article about encoding
>> > declarations in HTML (that review period ends today), I have revised and
>> > updated two further articles:
>> >
>> > Choosing & applying a character encoding
>> > http://www.w3.org/International/questions/qa-choosing-encodings-new
>> >
>> > Declaring character encodings in CSS
>> > http://www.w3.org/International/questions/qa-css-charset-new
>> >
>> > Please take a look and send any comments to www-international@w3.orgbefore
>> > 14th March.
>> >
>> > Thanks,
>> > RI
>> >
>> >
>> >
>> > On 28/02/2014 14:20, Richard Ishida wrote:
>> >> An updated version of Declaring character encodings in HTML[1] is out
>> >> for review at
>> >>
>> >>
>> >>
>> http://www.w3.org/International/questions/qa-html-encoding-declarations-new
>> >>
>> >> We are looking for comments before 7 March. Please send comments to
>> >> www-international@w3.org.
>> >>
>> >> After the review period is over, this content will be copied to the
>> same
>> >> location as the current version of the document, ie.
>> >>
>> >>
>> http://www.w3.org/International/questions/qa-html-encoding-declarations
>> >>
>> >> and the URL of the updated version will cease to exist.
>> >>
>> >> The update brings the article in line with recent developments in
>> HTML5,
>> >> and de-emphasizes information about legacy formats.
>> >>
>> >> An attempt was also made to organize the material so that readers can
>> >> find information more quickly, and also de-clutter the essential
>> >> information by moving edge topics, such as UTF-16 and charset links,
>> >> down the page. This led to the article being almost completely
>> rewritten.
>> >>
>> >>
>> >>
>> >> RI
>> >
>>
>>
>
Received on Saturday, 15 March 2014 05:18:26 UTC