- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 15 Jul 2010 21:47:53 +0400
- To: Richard Ishida <ishida@w3.org>
- Cc: 'Anne van Kesteren' <annevk@opera.com>, public-html@w3.org
Richard Ishida, Thu, 15 Jul 2010 17:18:02 +0100: > 2. If an html doc starts with a UTF-16 shaped BOM, any subsequent > meta encoding declarations are ignored. The browser sees the BOM and > stops looking for encoding information. I don't, however, see > anything prohibiting the use of the UTF-16 encoded *character* > sequence <meta charset="utf-16"> (in UTF-16 characters) in the > document. It will just be ignored by the browser when the detection > algorithm is run. (see below). > > 3. If there is no BOM and the browser encounters the ASCII *byte* > sequence <meta charset="utf-16">, following the detection algorithm, > there is something wrong, because you wouldn't see that sequence of > bytes in UTF-16. Therefore the browser assumes that the encoding is > actually UTF-8 and stops looking for other encoding information. Thanks for the helpful questions/explanation. My first comment is that I think you need to raise a bug with HTML5 if you want “the UTF-16 encoded *character* sequence <meta charset="utf-16">” to be valid in a HTML document. Secondly, here are some observations of what happens in Valididator.nu when validating a HTML5 document with polyglot markup as XHTML first, and then as HTML. The document contains '<meta charset="UTF-16"/>' and is UTF-16 encoded with a BOM. When using XHTML validation (I uploaded as document with .xhtml suffix), I got this message: ]] Error: Bad value UTF-16 for attribute charset on XHTML element meta. ]] From line 4, column 4; to line 4, column 27 ]] <head>↩ <meta charset="UTF-16"/> That error message does not make sense for XHTML, does it? Or why does a XML validator say what the value of an attribute that XML parsers do not look at, should be? However, the error message could - eventually - have made sense for HTML. (Seemingly, Validator.nu tries to ensure polyglot markup ...) When using HTML validation, then I got two messages: ]] Error: Internal encoding declaration specified utf-16 which is ]] not an ASCII superset. Continuing as if the encoding had been utf-8. ]] From line 4, column 4; to line 4, column 27 ]] <head>↩ <meta charset="UTF-16"/>↩ <t ]] ]] Error: Internal encoding declaration utf-8 disagrees with the ]] actual encoding of the document (utf-16). ]] From line 4, column 4; to line 4, column 27 ]] <head>↩ <meta charset="UTF-16"/>↩ <t [...] > So I think it is fine to have a meta element in a utf-16 encoded > document - it just won't be used by the browser for detection. Again, then a bug against HTML5 is needed - it can't be solved in Polyglot Markup alone. > It is > also best that utf-16 encoded documents start with a bom, to avoid > reliance on browser heuristics. In Polyglot Markup it is not only best, it is, as you have explained, _required_ to use the BOM for UTF-16 encoded documents. This, in order to be valid XML. But may be you meant that this follows by what you say here: > UTF-16 encoded XML documents, on the other hand, If you consider Polyglot Markup to be XML markup, then it does follow ... However, I think it is useful to say "Polyglot" or "Polyglot Markup" rather than to see such documents as (syntactically) XML documents. > must start with a > BOM, see http://www.w3.org/TR/REC-xml/#charencoding When the doc is > treated as XML, however, the meta element is ignored. When there is a BOM, then - for the special case of UTF-16, then there is no difference between HTML and XHTML parsing, it seems: the META declaration should be ignored. Thus, the only purpose becomes meta information for authors etc. It could make sense to allow '<meta charset="UTF-16"/>' in HTML documents, provided that the document has a valid UTF-16 BOM. If there is no BOM, then I do wonder if it would be logical to require validators to perform heuristics in order to decide whether '<meta charset="UTF-16"/>' provides correct meta data about the document? The difference between polyglot and HTML documents would be that in a polyglot, then the validity of '<meta charset="UTF-16"/>' would depend on the presence of a BOM. Whereas in a pure HTML document, then also the presence of a HTTP header which announces the encoding to be UTF-16, could probably also count as reason to accept '<meta charset="UTF-16"/>' as valid. > Do you agree? >> From: Anne van Kesteren [mailto:annevk@opera.com] >> FWIW, using <meta charset=utf-16> is an error in HTML. When used in XML >> its value must be UTF-8. See >> http://www.whatwg.org/specs/web-apps/current-work/complete.html#attr- >> meta-charset >> >> For HTML the requirements are more complex, but UTF-16 is not allowed and >> when specified is treated as UTF-8 (though if the document is actually >> encoded as UTF-16 it would be decoded as UTF-16). -- leif halvard silli
Received on Thursday, 15 July 2010 17:48:55 UTC