- From: Zack Weinberg <zackw@panix.com>
- Date: Fri, 14 Mar 2014 12:13:14 -0400
- To: Richard Ishida <ishida@w3.org>
- Cc: www International <www-international@w3.org>, W3C Style <www-style@w3.org>, "HTML WG (public-html@w3.org)" <public-html@w3.org>
I'd like to suggest that the "Avoid these encodings" section at the bottom of the "Choosing and applying a character set" document should be merged into the "Choosing an encoding" section at the top of that document. You are saying the same thing in two places but slightly differently (leading to confusion), and the "Avoid these encodings" section is (IMHO) one of the most important bits of the document - it should be up front. I'd write it like this: ## Choosing an encoding Encode new content in UTF-8. All of the present generation of Web standards, servers, clients, and libraries are designed to work best with UTF-8, and it allows you to use the same encoding for all of your content regardless of language. If you have a corpus of "legacy" content in some other encoding, you are strongly encouraged to convert it within your server and send clients UTF-8 anyway. If it is *impossible* for you to send UTF-8 over the network, you need to be aware that many other historical encodings are poorly, or not at all, supported by Web clients. [The Encoding Standard] contains an *exhaustive* list of "legacy" character encodings that are supported: anything not in the list simply will not work. Furthermore, UTF-32, UTF-16, JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), CESU-8, UTF-7, BOCU-1, SCSU, ISO-2022 (all varieties), and EBCDIC (all varieties) MUST NOT be used. These encodings are *ASCII-incompatible* -- that is, in these encodings, octets with values 00 through 7F (hexadecimal) are not always interpreted as Unicode code points U+0000 through U+007F. This has historically been a source of security vulnerabilities. The Big5 and EUC-JP encodings suffer from interoperability problems due to the large number of incompatible variants "in the wild", and should be avoided. ISO-8859-8 ("visually ordered" Hebrew) should also be avoided; if UTF-8 cannot be used for Hebrew, use ISO-8859-8-i, which like Unicode is "logically ordered". The "replacement" encoding, listed in the Encoding Standard, is not actually an encoding; it is a fallback that maps every octet to U+FFFD REPLACEMENT CHARACTER. Obviously, it is not useful to transmit data in this encoding. The "x-user-defined" encoding is a single-byte encoding whose lower half is ASCII and whose upper half is mapped into the Unicode Private Use Area. Like the PUA in general, using this encoding on the public Internet is best avoided. --- The other document ("Declaring character encodings in CSS") looks good to me, except for one technical point that needs clarified: If there is a byte order mark, that means the '@' in '@charset' is not the first byte of the stylesheet, and therefore the @charset directive is ineffective. (Unless the bit about IE 10 and 11 means that they skip the BOM when looking for @charset?) zw On Fri, Mar 7, 2014 at 7:49 AM, Richard Ishida <ishida@w3.org> wrote: > Following on from the revision of the i18n article about encoding > declarations in HTML (that review period ends today), I have revised and > updated two further articles: > > Choosing & applying a character encoding > http://www.w3.org/International/questions/qa-choosing-encodings-new > > Declaring character encodings in CSS > http://www.w3.org/International/questions/qa-css-charset-new > > Please take a look and send any comments to www-international@w3.org before > 14th March. > > Thanks, > RI > > > > On 28/02/2014 14:20, Richard Ishida wrote: >> An updated version of Declaring character encodings in HTML[1] is out >> for review at >> >> >> http://www.w3.org/International/questions/qa-html-encoding-declarations-new >> >> We are looking for comments before 7 March. Please send comments to >> www-international@w3.org. >> >> After the review period is over, this content will be copied to the same >> location as the current version of the document, ie. >> >> http://www.w3.org/International/questions/qa-html-encoding-declarations >> >> and the URL of the updated version will cease to exist. >> >> The update brings the article in line with recent developments in HTML5, >> and de-emphasizes information about legacy formats. >> >> An attempt was also made to organize the material so that readers can >> find information more quickly, and also de-clutter the essential >> information by moving edge topics, such as UTF-16 and charset links, >> down the page. This led to the article being almost completely rewritten. >> >> >> >> RI >
Received on Friday, 14 March 2014 16:13:54 UTC