Re: some review of HTML 5 charset details w.r.t. W3C Character Model from James Graham on 2007-11-01 (public-html@w3.org from November 2007)

From: James Graham <jg307@cam.ac.uk>
Date: Thu, 01 Nov 2007 00:18:16 +0000
To: duerst@it.aoyama.ac.jp, "public-html@w3.org WG" <public-html@w3.org>
Message-ID: <47291B48.5050600@cam.ac.uk>
> In 8.2.2.4, I have no idea what's the reason or purpose of point 1,
> which reads "If the new encoding is UTF-16, change it to UTF-8.".
> I suspect some misunderstanding.

AIUI, the section 8.2.2.4 is only invoked if a <meta charset=""> or 
<meta content=""> attribute is found that specifies an encoding that is 
different from the encoding that has been used to parse the file to this 
point /and/ the encoding that has been used thus far has come from one 
of the sources which provide "tentative" character encodings (in order):

* a prescan of the file for <meta> elements
* UA knowledge of the page's encoding (e.g. cached from a previous visit)
* chardet-like frequency analysis or some similar method for determining 
the encoding
* the default encoding

In particular, if the encoding is specified at the transport layer or a 
BOM is found, the encoding is "certain" rather than "tentative" and 
section 8.2.2.4 never applies.

Therefore, in order for the encoding to get a file that is really UTF-16 
to be changed to UTF-16 by the steps of 8.2.2.4, sources for "tentative" 
  character encodings above would have to report a non UTF-16 encoding 
that is sufficiently close to UTF-16 that the UA could still correctly 
interpret the <meta> element declaring the UTF-16 encoding. AFAIK, this 
is not possible (but I am not a character encoding expert). This leaves 
two options on encountering a <meta> tag declaring the file to be 
UTF-16; ignore it or assume a typo and replace it with UTF-8. The spec 
currently takes the second of these options.

Actually the situation is somewhat worse than above because the <meta> 
prescan can, in principle, return UTF-16 as the character encoding even 
though it is not possible for the prescan to work on a file that is 
actually UTF-16. This is not dealt with in the current spec, but it is a 
known issue [1]. The failure to deal with an inaccurate deceleration of 
UTF-16 inside a <meta> element has been reported as a bug in html5lib 
[2], which we have fixed by taking UTF-16 to mean UTF-8 when determining 
the eoncoding from <meta> elements, consistent with the aove case in the 
current spec and in line with the behavior of the validator.nu parser 
(fwiw Gecko + Opera appear to ignore all <meta> elements declaring a 
charset of UTF-16).

[1] 
http://canvex.lazyilluminati.com/misc/cgi/issues.cgi/message/<4B6BAB92-CC5A-4943-A92A-01F99C569761%40iki.fi>
[2] http://code.google.com/p/html5lib/issues/detail?id=55

> This brings me to another point: The whole HTML5 spec seems to be written
> with implementers, and implementers only, in mind. This is great to help
> get browser behavior aligned, but it creates an enormous problem: The
> majority of potential users of the spec, namely creators of content, and
> of tools creating content, are completely left out. As an example,
> trying to reverse-engineer how to indicate the character encoding
> inside an HTML5 document from point 4 in 8.2.2.1 is completely impossible
> for content creators, webmasters, and the like.

How to specify a (non-transport layer) character encoding is specified 
in 3.7.5.4 [3]; the whole of section 8.2 is aimed at implementors. That 
said, more internal links could help people looking at implementation 
instructions when they are after authoring requirements.

[3] http://www.whatwg.org/specs/web-apps/current-work/#charset

-- 
"Mixed up signals
Bullet train
People snuffed out in the brutal rain"
--Conner Oberst
Received on Thursday, 1 November 2007 00:17:12 UTC