- From: Anne van Kesteren <annevk@opera.com>
- Date: Thu, 15 Jul 2010 19:53:50 +0200
- To: "Richard Ishida" <ishida@w3.org>
- Cc: public-html@w3.org, "'Leif Halvard Silli'" <xn--mlform-iua@xn--mlform-iua.no>
On Thu, 15 Jul 2010 18:18:02 +0200, Richard Ishida <ishida@w3.org> wrote: > My understanding from reading section 4.2.5.5 of the HTML5 spec[1] is > the following: > > 1. An HTML5 document can be served as UTF-16. Right. > The following points assume that no encoding info is supplied in the > HTTP header. > > 2. If an html doc starts with a UTF-16 shaped BOM, any subsequent meta > encoding declarations are ignored. The browser sees the BOM and stops > looking for encoding information. I don't, however, see anything > prohibiting the use of the UTF-16 encoded *character* sequence <meta > charset="utf-16"> (in UTF-16 characters) in the document. It will just > be ignored by the browser when the detection algorithm is run. (see > below). http://www.whatwg.org/specs/web-apps/current-work/complete/semantics.html#attr-meta-charset links to "character encoding declaration" http://www.whatwg.org/specs/web-apps/current-work/complete/semantics.html#character-encoding-declaration which says among a lot of other things "the character encoding used must be an ASCII-compatible character encoding" which links to http://www.whatwg.org/specs/web-apps/current-work/complete/infrastructure.html#ascii-compatible-character-encoding which makes it clear UTF-16 is invalid. The browser will ignore it, yes. > 3. If there is no BOM and the browser encounters the ASCII *byte* > sequence <meta charset="utf-16">, following the detection algorithm, > there is something wrong, because you wouldn't see that sequence of > bytes in UTF-16. Therefore the browser assumes that the encoding is > actually UTF-8 and stops looking for other encoding information. Right. > 4. If a UTF-16 encoded doc does not start with a BOM, the meta > declaration will not be recognized as anything by the detection > algorithm, because the bytes don't match the pattern being looked for in > the algorithm. A browser could, however, use heuristics after all else > fails to detect that the encoding is UTF-16, though this has nothing to > do with any meta element. Right. > So I think it is fine to have a meta element in a utf-16 encoded > document - it just won't be used by the browser for detection. It is > also best that utf-16 encoded documents start with a bom, to avoid > reliance on browser heuristics. It is not fine. See above. > UTF-16 encoded XML documents, on the other hand, must start with a BOM, > see http://www.w3.org/TR/REC-xml/#charencoding When the doc is treated > as XML, however, the meta element is ignored. > > Do you agree? Not entirely, but mostly. > [1] http://dev.w3.org/html5/spec/semantics.html#charset -- Anne van Kesteren http://annevankesteren.nl/
Received on Thursday, 15 July 2010 17:54:44 UTC