Re: Encoding: Referring people to a list of labels from Henri Sivonen on 2014-01-27 (www-international@w3.org from January to March 2014)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Mon, 27 Jan 2014 11:05:37 +0200
To: "www-international@w3.org" <www-international@w3.org>
Message-ID: <CANXqsR+myEQ5St3sRnoDkTopxnq27D-_dKPGpTNSHhc-OgJfSQ@mail.gmail.com>

On Fri, Jan 24, 2014 at 3:55 PM, Richard Ishida <ishida@w3.org> wrote:
> I'm thinking that we should be pointing them to the Encoding spec, rather
> than the IANA list.

Good idea.

> We could point at http://encoding.spec.whatwg.org/#concept-encoding-get,
> although there are two issues with that:
>
> 1. that table isn't really intended to provide a list of labels people
> should use, it maps labels to encodings
>
> 2. the most commonly used label for an encoding, where there are more than
> one per encoding, is generally not at the top of the list (although it is
> used for the name of the encoding).

3. The encodings don't have equal status:

 * Apart from UTF-8, GB18030 is the only other encoding that can be
used for form submissions without data loss.

 * x-user-defined must not be used except in overrideMimeType() in XHR
in browser versions that don't support obtaining the response bytes as
an ArrayBuffer. (Publishers who use intentionally mis-encoded fonts
with @font-face, which of course no one should do, are better off
declaring windows-1252 even if that means they are polluting search
data for everyone else.)

 * The labels that map to  the replacement encoding must not be used
and it makes no sense to use them.

 * UTF-16BE, UTF-16LE (including the UTF-16 label), HZ-GB-2312 and
ISO-2022-JP are dangerous and authors should expect browser vendors
take varying levels of countermeasures against these, which makes its
a bad idea to use these. (In particular, if telemetry data permits, I
intend to map HZ-GB-2312, which is *really* scary, to the replacement
encoding in Gecko.) Even if browser don't take countermeasures, it's
still a bad idea to use these, because they are dangerous (especially
for encoding user-supplied content). Consider these as "must not use".

 * The implementation status of Big5 is sad. Would-be users of Big5
should migrate to UTF-8 even more hastily than users of the other
legacy encodings.

 * There are interoperability issues with the parts of EUC-JP that an
Encoding Standard-compliant *encoder* never outputs. Would-be users of
EUC-JP should migrate to UTF-8 even more hastily than the users of
other legacy encodings.

 * One shouldn't expect the current state of the Encoding Standard to
be the last word on ibm866, x-mac-cyrillic and koi8-u. Don't use them.

 * Don't use iso-8859-8 (Visual Hebrew). Support may be going away in
the future. Always use the logical order for Hebrew.

So, really, people should only use one encoding, UTF-8, and the list
of labels they should use should have one item only: "UTF-8".

-- 
Henri Sivonen
hsivonen@hsivonen.fi
https://hsivonen.fi/

Received on Monday, 27 January 2014 09:06:08 UTC