Encoding interaction of HTTP response header and META tag from Wayne Pollock on 2011-03-03 (public-html-comments@w3.org from March 2011)

From: Wayne Pollock <pollock@acm.org>
Date: Thu, 03 Mar 2011 15:38:07 -0500
To: public-html-comments@w3.org
Message-ID: <4D6FFC2F.90209@acm.org>
In HTML 4 and 5, there is a glaring, annoying error.  Or at least
it seems that way to me.  (This is related to but not identical
to, issue 148.)

The standard says (implicitly in 5.2.2, explicitly in 8.2.2.1, and
in 10.2.2.1 from the whatwg HTML 5 document) if the web server sets
the character encoding in the HTTP response header, then that is used,
and the encoding sniffing algorithm, e.g., the BOM, then the author's
META tag of (say):

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

or the HTML 5 version:

  <meta charset="UTF-8">

is ignored.  However this is backward.  If document authors
goes to the trouble of stating the charset in the HEAD of their document,
that that should override any default set by the web sever.

The rational for this is given in section 5.2.2:

"Some servers examine the first few bytes of the document, or check against a
database of known files and encodings. Many modern servers give Web masters more
control over charset configuration than old servers do. Web masters should use these
mechanisms to send out a "charset" parameter whenever possible, but should take care
not to identify a document with the wrong "charset" parameter value."

Here's why I think this is wrong and should be changed:

In today's world a single website may have multiple web pages written by
multiple authors.  Each could be using a different charset.  A web
server typically has a setting to return the charset in the HTTP
response header, by looking only at the file extension.  It is a huge
burden to webmasters everywhere to have to manually set the charset
for every update to their website.  This is what Apache does, for instance.

TO OVERRIDE THE DEFAULT CHARSET RETURNED BY APACHE, A PER FILE DIRECTIVE MUST
BE USED TO SPECIFY EACH FILE'S CHARSET.  Such overriding is possible but
to allow web authors the ability to do so, per directory settings must be
enabled (the ".htaccess" files).  doing so severely impacts server performance
and many sites simply can't do so, so web pages WILL be send with the wrong
charset.  I'm sure some other web servers are similar and do not sniff the
MIME type or charset.  However most browsers do.

The alternative is to not have the web server return any such header,
allowing the browser to examine the document for a BOM and then META tag
that sepcifies the charset used.

But  allowing web page authors to override the (default) charset sent by
a web server with the appropriate META tag, is entirely reasonable to me.

How many times have you visited some web page only to find curly quotes,
bullets, etc., don't render correctly because despite a correct META tag,
the CMS used sent a default value in the HTTP response header?  This is
a problem that need not exist.

Part of the problem seems to be that very early on there was no way for document
authors to include charset info in their documents, so web browsers evolved
to use what the web server said.  When the META tag to allow web authors
t set the charset was added, the rules were written not to break existing
practice.  HTML 5 seems to be following this vicious cycle: browsers follow
the standard, and the standard follows the browser practice.

This should be a simple fix.  The issue was raised on the WHATWG list and
elsewhere, and noboby could think of an objection to this proposal.  (The
only web pages that could "break" with this change were already broken.)

Summary:  Change existing determination of charset by moving step two
  2. If the transport layer specifies an encoding, and it is supported,
     return that encoding with the confidence certain, and abort these steps.
to follow (existing) step 5.

===================================================================

On a related note, the new structural tags that denote articles and such
should allow an optional CHARSET attribute.  A web page with ARTICLEs
etc. may be (and may likely be) composed of content from many sources,
e.g., a "mash-up".  While CMS and blogging software could force a single
charset so there is only one charset per web page,  that seems an
unnecessary restriction (and I don't know that most blogging software
works that way).

>From what I know of how modern web browsers work, it would be relatively
easy to allow.  However, I know this one isn't an over site (like my main
point above) but a deliberate decision; the A tag's CHARSET attribute has
been deprecated.  However, CHARSET attribute was *added* to the SCRIPT tag,
so what is the rational for not allowing different parts of a document that
clearly are intended to represent content from different sources?

I believe a CHARSET attribute should be allowed on some block level tags,
including DIV, ARTICLE, and possibly SECTION, IFRAME, and INPUT.  If my main
suggestion is followed, there is probably no need for CHARSET attribute on
an A tag.

-- 
Wayne Pollock
Received on Thursday, 3 March 2011 20:38:36 UTC