- From: James Graham <jg307@cam.ac.uk>
- Date: Thu, 01 Nov 2007 00:18:16 +0000
- To: duerst@it.aoyama.ac.jp, "public-html@w3.org WG" <public-html@w3.org>
> In 8.2.2.4, I have no idea what's the reason or purpose of point 1, > which reads "If the new encoding is UTF-16, change it to UTF-8.". > I suspect some misunderstanding. AIUI, the section 8.2.2.4 is only invoked if a <meta charset=""> or <meta content=""> attribute is found that specifies an encoding that is different from the encoding that has been used to parse the file to this point /and/ the encoding that has been used thus far has come from one of the sources which provide "tentative" character encodings (in order): * a prescan of the file for <meta> elements * UA knowledge of the page's encoding (e.g. cached from a previous visit) * chardet-like frequency analysis or some similar method for determining the encoding * the default encoding In particular, if the encoding is specified at the transport layer or a BOM is found, the encoding is "certain" rather than "tentative" and section 8.2.2.4 never applies. Therefore, in order for the encoding to get a file that is really UTF-16 to be changed to UTF-16 by the steps of 8.2.2.4, sources for "tentative" character encodings above would have to report a non UTF-16 encoding that is sufficiently close to UTF-16 that the UA could still correctly interpret the <meta> element declaring the UTF-16 encoding. AFAIK, this is not possible (but I am not a character encoding expert). This leaves two options on encountering a <meta> tag declaring the file to be UTF-16; ignore it or assume a typo and replace it with UTF-8. The spec currently takes the second of these options. Actually the situation is somewhat worse than above because the <meta> prescan can, in principle, return UTF-16 as the character encoding even though it is not possible for the prescan to work on a file that is actually UTF-16. This is not dealt with in the current spec, but it is a known issue [1]. The failure to deal with an inaccurate deceleration of UTF-16 inside a <meta> element has been reported as a bug in html5lib [2], which we have fixed by taking UTF-16 to mean UTF-8 when determining the eoncoding from <meta> elements, consistent with the aove case in the current spec and in line with the behavior of the validator.nu parser (fwiw Gecko + Opera appear to ignore all <meta> elements declaring a charset of UTF-16). [1] http://canvex.lazyilluminati.com/misc/cgi/issues.cgi/message/<4B6BAB92-CC5A-4943-A92A-01F99C569761%40iki.fi> [2] http://code.google.com/p/html5lib/issues/detail?id=55 > This brings me to another point: The whole HTML5 spec seems to be written > with implementers, and implementers only, in mind. This is great to help > get browser behavior aligned, but it creates an enormous problem: The > majority of potential users of the spec, namely creators of content, and > of tools creating content, are completely left out. As an example, > trying to reverse-engineer how to indicate the character encoding > inside an HTML5 document from point 4 in 8.2.2.1 is completely impossible > for content creators, webmasters, and the like. How to specify a (non-transport layer) character encoding is specified in 3.7.5.4 [3]; the whole of section 8.2 is aimed at implementors. That said, more internal links could help people looking at implementation instructions when they are after authoring requirements. [3] http://www.whatwg.org/specs/web-apps/current-work/#charset -- "Mixed up signals Bullet train People snuffed out in the brutal rain" --Conner Oberst
Received on Thursday, 1 November 2007 00:17:12 UTC