Contradiction wrt Defaulting Characters Sets (HTML 4.01) from David Woolley on 2003-05-08 (www-html@w3.org from May 2003)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Thu, 8 May 2003 20:52:07 +0100 (BST)
To: www-html@w3.org
Message-Id: <200305081952.h48Jq7d06492@djwhome.demon.co.uk>

It's been pointed out to me that the HTML 4.01 specification overrides the
HTTP 1.1 one by not allowing user agents to assume a default transfer
character set.  However, it also allows user agents to use heuristics
to determine the character set.  To me it seems that assuming a default
of ISO 8859/1 is one possible heuristic, so one part is saying that one
absolutely must not do that and another part is saying that one may do
so!

I understand why HTTP 1.1 is being overridden: the original HTML
was only properly defined for ISO 8859/1 and early browsers weren't
adequately character coding aware (not aware, or only aware of a rather
more limited range of character sets than those in active use).  As a
result, language and OS communities got used to treating the default
as though it were their preferred character set (e.g.  big5 for the
Taiwanese Chinese language community and Windows-1252 for the Windows
OS community).  The next generation of browsers then had to cope with
unlabelled documents, which weren't in the official default character
set.

The current wording is an attempt to bring the specification into line
with what browsers were actually doing, although it doesn't even 
completely do that, as it doesn't allow them to use heuristics when 
they find a numeric entity purporting to be in the (illegal) C1 control
character range.

However, I think it puts a MUST NOT condition on the user agent, when it
really should be saying that authors SHOULD explicitly specify a 
character set whenever, in ISO terms, a machine readable HTML file is
involved in "interchange".

Received on Thursday, 8 May 2003 15:52:50 UTC