Re: Two new encoding related articles for review from Joshua Cranmer on 2014-03-16 (www-style@w3.org from March 2014)

From: Joshua Cranmer <Pidgeot18@verizon.net>
Date: Sun, 16 Mar 2014 17:48:11 -0500
To: www-style@w3.org
Message-id: <53262A2B.30708@verizon.net>

On 3/14/2014 10:15 PM, Zack Weinberg wrote:
> On Fri, Mar 14, 2014 at 10:42 PM, Glenn Adams <glenn@skynav.com> wrote:
>> On Fri, Mar 14, 2014 at 10:13 AM, Zack Weinberg <zackw@panix.com> wrote:
> ...
>>> Furthermore, UTF-32, UTF-16, JIS_C6226-1983, JIS_X0212-1990,
>>> HZ-GB-2312, JOHAB (Windows code page 1361), CESU-8, UTF-7, BOCU-1,
>>> SCSU, ISO-2022 (all varieties), and EBCDIC (all varieties) MUST NOT be
>>> used.  These encodings are *ASCII-incompatible* -- that is, in these
>>> encodings, octets with values 00 through 7F (hexadecimal) are not
>>> always interpreted as Unicode code points U+0000 through U+007F.  This
>>> has historically been a source of security vulnerabilities.
>> It seems strange for a guideline to say "MUST NOT". I would suggest SHOULD
>> NOT is more appropriate. In any case, we shouldn't be in the business of
>> telling content authors what they can or can't do. If they want to use an
>> encoding that isn't well supported, then the risk is theirs.
> You can tell I'm used to writing normative specs, huh?  How's this instead?
>
> "UTF-32, UTF-16, (etcetera) are especially unlikely to work: HTML5 and
> the Encoding Standard forbid Web clients from accepting most of them.
> (These encodings are *ASCII-incompatible* -- octets with values 00
> through 7F (hexadecimal) do not always encode U+0000 through U+007F --
> which has historically been a source of security vulnerabilities.)"
>
Strictly speaking, it's not completely true. UTF-16, HZ-GB-2312, and 
ISO-2022-JP are both permitted by the encoding standard. CESU-8. UTF-7, 
BOCU-1, and SCSU are explicitly prohibited by hTML5 (although email 
clients need to support UTF-7, unfortunately). EBCDIC and UTF-32 are 
"especially discouraged" (to the point that HTML5 doesn't attempt to 
support them, like it does UTF-16 via BOM detection). The JIS_* and 
JOHAB standards are mentioned by neither the encoding standard nor 
HTML5. ISO-2022-CN and ISO-2022-KR are mapped to the replacement 
encoding and so are effectively banned by the encoding standard (AIUI).

-- 
Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth

Received on Sunday, 16 March 2014 22:49:03 UTC