[whatwg] Internal character encoding declaration from Henri Sivonen on 2005-08-08 (public-whatwg-archive@w3.org from August 2005)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 8 Aug 2005 21:42:27 +0300
Message-ID: <9fa714f3d1818281c52125f4620b7240@iki.fi>
Quoting from WA1 draft section 2.2.5.1. Specifying and establishing the 
document's character encoding:

> The meta element may also be used, in HTML only (not in XHTML) to 
> provide UAs with character encoding information for the file. To do 
> this, the meta element must be the first element in the head element,

To cater for implementations that consume the byte stream only once in 
all cases and do not rewind the input and restart the parser upon 
discovering the meta, I think it would be beneficial to additionally 
stipulate that
1. The meta element-based character encoding information declaration is 
expected to work only if the Basic Lating range of characters maps to 
the same bytes as in the US-ASCII encoding.
2. If there is no external character encoding information nor a BOM 
(see below), there MUST NOT be any non-ASCII bytes in the document byte 
stream before the end of the meta element that declares the character 
encoding. (In practice this would ban unescaped non-ASCII class names 
on the html and body elements and non-ASCII comments at the beginning 
of the document.)

> it must have the http-equiv attribute set to the literal value 
> Content-Type,

I think case-insensitivity should be allowed in the string 
"Content-Type", because there is legacy precedent for that and HTTP 
defines header names as case-insensitive.

> and must have the content attribute set to the literal value 
> text/html; charset=

That string should be case-insensitive as well, because HTTP defines it 
case-insensitive. Also, should zero or more white space characters be 
allowed before ';' and around '=' and should the space after ';' be one 
or more white space characters? HTTP-wise yes, but would it lead to 
real-world incompatibilities? (I have not tested.)

> immediately followed by the character encoding, which must be a valid 
> character encoding name. [IANACHARSET] When the meta element is used 
> in this way, there must be no other attributes set on the element. 
> Other than for giving the document's character encoding in this way, 
> the http-equiv attribute must not be used.
>
> In XHTML, the XML declaration should be used for inline character 
> encoding information.

Excellent.

> Authors should avoid including inline character encoding information. 
> Character encoding information should instead be included at the 
> transport level (e.g. using the HTTP Content-Type header).

I disagree.

With HTML with contemporary UAs, there is no real harm in including the 
character encoding information both on the HTTP level and in the meta 
as long as the information is not contradictory. On the contrary, the 
author-provided internal information is actually useful when end users 
save pages to disk using UAs that do not reserialize with internal 
character encoding information.

With XML, there is a robust method for identifying the character 
encoding internally. When the encoding is explicit, the sniffing is 
also interoperably implemented. (Unfortunately, for the BOMless 
implicit case, see http://bugzilla.opendarwin.org/show_bug.cgi?id=3809 
. Gecko used to have the same bug.) RFC 3023's insistence on declaring 
the encoding authoritatively outside the XML byte stream itself is, in 
my opinion, as silly as insisting on declaring the compression method 
of a zip archive authoritatively on the HTTP level instead of using the 
information stored in the file.

The TAG has found "Thus there is no ambiguity when the charset is 
omitted, and the STRONGLY RECOMMENDED injunction [of RFC 3023] to use 
the charset is misplaced for application/xml and for non-text "+xml" 
types." (http://www.w3.org/2001/tag/2004/0430-mime.html#char-encoding).

> For HTML, user agents must use the following algorithm in determining 
> the character encoding of a document:
> 1. If the transport layer specifies an encoding, use that.

Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; 
UTF-32 makes no practical sense for interchange on the Web.)

> 2. Otherwise, if the user agent can find a meta element that specifies 
> character encoding information (as described above), then use that.

If a conformance checker has not determined the character encoding by 
now, what should it do? Should it report the document as non-conforming 
(my preferred choice)? Should it default to US-ASCII and report any 
non-ASCII bytes as conformance errors? Should it continue to the 
fuzzier steps like browsers would (hopefully not)?

> 3. Otherwise, if the user agent can autodetect the character encoding 
> from applying frequency analysis or other algorithms to the data 
> stream, then use that.
> 4. Otherwise, use an implementation-defined or user-specified default 
> character encoding (ISO-8859-1, windows-1252, and UTF-8 are 
> recommended as defaults, and can in many cases be identified by 
> inspection as they have different ranges of valid bytes).

I think it does not make sense to recommend ISO-8859-1, because 
windows-1252 is always a better guess in practice. In the context of 
HTML, UTF-8 looks like a weird default considering years of precedent 
with the de facto windows-1252 default. (Of course, if the UA is 
willing to examine the entire byte stream before parsing, UTF-8 can be 
detected very reliably.)

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
Received on Monday, 8 August 2005 11:42:27 UTC