- From: David Woolley <david@djwhome.demon.co.uk>
- Date: Thu, 8 May 2003 20:52:07 +0100 (BST)
- To: www-html@w3.org
It's been pointed out to me that the HTML 4.01 specification overrides the HTTP 1.1 one by not allowing user agents to assume a default transfer character set. However, it also allows user agents to use heuristics to determine the character set. To me it seems that assuming a default of ISO 8859/1 is one possible heuristic, so one part is saying that one absolutely must not do that and another part is saying that one may do so! I understand why HTTP 1.1 is being overridden: the original HTML was only properly defined for ISO 8859/1 and early browsers weren't adequately character coding aware (not aware, or only aware of a rather more limited range of character sets than those in active use). As a result, language and OS communities got used to treating the default as though it were their preferred character set (e.g. big5 for the Taiwanese Chinese language community and Windows-1252 for the Windows OS community). The next generation of browsers then had to cope with unlabelled documents, which weren't in the official default character set. The current wording is an attempt to bring the specification into line with what browsers were actually doing, although it doesn't even completely do that, as it doesn't allow them to use heuristics when they find a numeric entity purporting to be in the (illegal) C1 control character range. However, I think it puts a MUST NOT condition on the user agent, when it really should be saying that authors SHOULD explicitly specify a character set whenever, in ISO terms, a machine readable HTML file is involved in "interchange".
Received on Thursday, 8 May 2003 15:52:50 UTC