W3C home > Mailing lists > Public > public-html@w3.org > July 2010

Re: i18n comments on Polyglot Markup

From: Anne van Kesteren <annevk@opera.com>
Date: Thu, 15 Jul 2010 19:53:50 +0200
To: "Richard Ishida" <ishida@w3.org>
Cc: public-html@w3.org, "'Leif Halvard Silli'" <xn--mlform-iua@xn--mlform-iua.no>
Message-ID: <op.vfwbzjje64w2qv@annevk-t60>
On Thu, 15 Jul 2010 18:18:02 +0200, Richard Ishida <ishida@w3.org> wrote:
> My understanding from reading section of the HTML5 spec[1]  is  
> the following:
> 1. An HTML5 document can be served as UTF-16.


> The following points assume that no encoding info is supplied in the  
> HTTP header.
> 2. If an html doc starts with a UTF-16 shaped BOM, any subsequent meta  
> encoding declarations are ignored. The browser sees the BOM and stops  
> looking for encoding information.  I don't, however, see anything  
> prohibiting the use of the UTF-16 encoded *character* sequence <meta  
> charset="utf-16"> (in UTF-16 characters) in the document. It will just  
> be ignored by the browser when the detection algorithm is run. (see  
> below).


links to "character encoding declaration"


which says among a lot of other things "the character encoding used must  
be an ASCII-compatible character encoding" which links to


which makes it clear UTF-16 is invalid. The browser will ignore it, yes.

> 3. If there is no BOM and the browser encounters the ASCII *byte*  
> sequence <meta charset="utf-16">, following the detection algorithm,  
> there is something wrong, because you wouldn't see that sequence of  
> bytes in UTF-16.  Therefore the browser assumes that the encoding is  
> actually UTF-8 and stops looking for other encoding information.


> 4. If a UTF-16 encoded doc does not start with a BOM, the meta  
> declaration will not be recognized as anything by the detection  
> algorithm, because the bytes don't match the pattern being looked for in  
> the algorithm.  A browser could, however, use heuristics after all else  
> fails to detect that the encoding is UTF-16, though this has nothing to  
> do with any meta element.


> So I think it is fine to have a meta element in a utf-16 encoded  
> document - it just won't be used by the browser for detection.  It is  
> also best that utf-16 encoded documents start with a bom, to avoid  
> reliance on browser heuristics.

It is not fine. See above.

> UTF-16 encoded XML documents, on the other hand, must start with a BOM,  
> see http://www.w3.org/TR/REC-xml/#charencoding  When the doc is treated  
> as XML, however, the meta element is ignored.
> Do you agree?

Not entirely, but mostly.

> [1] http://dev.w3.org/html5/spec/semantics.html#charset

Anne van Kesteren
Received on Thursday, 15 July 2010 17:54:44 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 29 October 2015 10:16:03 UTC