Feedback on http://www.w3.org/International/questions/qa-html-encoding-declarations-new from Henri Sivonen on 2014-02-28 (www-international@w3.org from January to March 2014)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Fri, 28 Feb 2014 17:03:45 +0200
To: "www-international@w3.org" <www-international@w3.org>
Message-ID: <CANXqsRL1agGeShtEC7Knia=3vLCpi-XGP59x13Qjy0trLkynsQ@mail.gmail.com>

As written, the Quick Answer is misleading if you only read that part
and skip the Details. The Quick Answer says "If you have access to the
server settings, you should also consider whether it makes sense to
use the HTTP header." Instead, it should emphasize that HTTP overrides
<meta>, so if you don't have access to the server settings and the
server is sending a charset parameter in the Content-Type header, the
Quick Answer won't work for you.

The document links to http://www.w3.org/International/O-HTTP-charset
which doesn't cover nginx configuration. nginx behavior is worth
mentioning, since nginx configuration is a bit surprising: You have to
use the charset directive and can't use add_header, because the latter
appends *another* Content-Type header and, therefore, must not be used
to attempts to refine headers that nginx already adds by other means.

Back to qa-html-encoding-declarations-new:
The document says: "Intermediate servers that transcode the data (ie.
convert to a different encoding) sometimes take advantage of this to
change the encoding of a document before sending it on to small
devices that only recognize a few encodings. Because the HTTP header
information has precedence over any in-document declaration,
transcoders typically do not change the internal encoding
declarations, just the document encoding and the declaration in the
HTTP headers."

Is there documented proof that that's actually true?

"User agents can easily find the character encoding information when
it is sent in the HTTP header."

I suggest saying that they find it sooner. Any non-bogus user agent
has to be able to handle the level of difficulty of finding it in
<meta>.

I think the section "Working with polyglot and XML formats", if
retained at all, should go under "Obscure details you should not need
to know".

Please delete "It is possible to invent your own encoding names
preceded by x-, but this is not usually a good idea since it limits
interoperability." It has no relevance to authoring documents that
will be viewed in Web browsers.

The section "The charset attribute on a link" fails to mention that if
browsers supported the attribute (without special additional rules),
it would be an XSS attack vector, which is a good reason not to
support it.

The document also links to
http://www.w3.org/International/questions/qa-choosing-encodings .
While that document correctly advises against the use of ISO-2022-*,
HZ, etc., it fails to warn about interoperability problems between
EUC-JP implementations on one hand and Big5 implementations on the
other. I.e. authors are safer also avoiding EUC-JP and Big5 (including
and especially Big5-HKSCS).

-- 
Henri Sivonen
hsivonen@hsivonen.fi
https://hsivonen.fi/

Received on Friday, 28 February 2014 15:04:14 UTC