Re: some review of HTML 5 charset details w.r.t. W3C Character Model

> In, I have no idea what's the reason or purpose of point 1,
> which reads "If the new encoding is UTF-16, change it to UTF-8.".
> I suspect some misunderstanding.

AIUI, the section is only invoked if a <meta charset=""> or 
<meta content=""> attribute is found that specifies an encoding that is 
different from the encoding that has been used to parse the file to this 
point /and/ the encoding that has been used thus far has come from one 
of the sources which provide "tentative" character encodings (in order):

* a prescan of the file for <meta> elements
* UA knowledge of the page's encoding (e.g. cached from a previous visit)
* chardet-like frequency analysis or some similar method for determining 
the encoding
* the default encoding

In particular, if the encoding is specified at the transport layer or a 
BOM is found, the encoding is "certain" rather than "tentative" and 
section never applies.

Therefore, in order for the encoding to get a file that is really UTF-16 
to be changed to UTF-16 by the steps of, sources for "tentative" 
  character encodings above would have to report a non UTF-16 encoding 
that is sufficiently close to UTF-16 that the UA could still correctly 
interpret the <meta> element declaring the UTF-16 encoding. AFAIK, this 
is not possible (but I am not a character encoding expert). This leaves 
two options on encountering a <meta> tag declaring the file to be 
UTF-16; ignore it or assume a typo and replace it with UTF-8. The spec 
currently takes the second of these options.

Actually the situation is somewhat worse than above because the <meta> 
prescan can, in principle, return UTF-16 as the character encoding even 
though it is not possible for the prescan to work on a file that is 
actually UTF-16. This is not dealt with in the current spec, but it is a 
known issue [1]. The failure to deal with an inaccurate deceleration of 
UTF-16 inside a <meta> element has been reported as a bug in html5lib 
[2], which we have fixed by taking UTF-16 to mean UTF-8 when determining 
the eoncoding from <meta> elements, consistent with the aove case in the 
current spec and in line with the behavior of the parser 
(fwiw Gecko + Opera appear to ignore all <meta> elements declaring a 
charset of UTF-16).


> This brings me to another point: The whole HTML5 spec seems to be written
> with implementers, and implementers only, in mind. This is great to help
> get browser behavior aligned, but it creates an enormous problem: The
> majority of potential users of the spec, namely creators of content, and
> of tools creating content, are completely left out. As an example,
> trying to reverse-engineer how to indicate the character encoding
> inside an HTML5 document from point 4 in is completely impossible
> for content creators, webmasters, and the like.

How to specify a (non-transport layer) character encoding is specified 
in [3]; the whole of section 8.2 is aimed at implementors. That 
said, more internal links could help people looking at implementation 
instructions when they are after authoring requirements.


"Mixed up signals
Bullet train
People snuffed out in the brutal rain"
--Conner Oberst

Received on Thursday, 1 November 2007 00:17:12 UTC