Re: Wrong character encoding detect for XHTML

Karl Ove Hufthammer <huftis@bigfoot.com> wrote:

> Testcase:
> <URL: http://home.no.net/huftis/kritikk/false-encoding.html >
> 
> This document is an XHTML 1.1 document with no XML declaration.
> No 'charset' parameter is sent by HTTP, therefore, the document
> uses the character encoding 'UTF-8' (the default for all
> X(HT)ML documents).

No.  You used 'text/html' for your document, then RFC 2854 [1] applies.
The 'text/html' media type registration itself doesn't define
the default value for the charset parameter, and as noted in
"6. Charset default rules" of RFC 2854, RFC 2616 [2] section 3.7.1
defines that "media subtypes of the 'text' type are defined to
have a default charset value of 'ISO-8859-1'" (for good or bad). 
Section 5.2.2 of the HTML 4 spec [3] further says that "[i]n practice,
this recommendation has proved useless ... Therefore, user agents must
not assume any default value for the "charset" parameter".

Note that even for 'text/xml', UTF-8 is not the default.  As defined
in section 3.1 of RFC 3023 [4], the default charset value for the
'text/xml' media type is US-ASCII.

Both RFC 2854 and RFC 3023 recommend UTF-8 as a recommended (not
a default) value, but more importantly, both RFC *strongly* recommend
to add an explicit charset parameter to avoid confusion.

[1] http://www.rfc-editor.org/rfc/rfc2854.txt
[2] http://www.rfc-editor.org/rfc/rfc2616.txt
[3] http://www.w3.org/TR/html4/charset.html#h-5.2.2
[4] http://www.rfc-editor.org/rfc/rfc3023.txt

Regards,
-- 
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium

Received on Friday, 7 December 2001 07:55:37 UTC