Re: i18n comments on Polyglot Markup

On Thu, 15 Jul 2010 18:18:02 +0200, Richard Ishida <ishida@w3.org> wrote:
> My understanding from reading section 4.2.5.5 of the HTML5 spec[1]  is  
> the following:
>
> 1. An HTML5 document can be served as UTF-16.

Right.


> The following points assume that no encoding info is supplied in the  
> HTTP header.
>
> 2. If an html doc starts with a UTF-16 shaped BOM, any subsequent meta  
> encoding declarations are ignored. The browser sees the BOM and stops  
> looking for encoding information.  I don't, however, see anything  
> prohibiting the use of the UTF-16 encoded *character* sequence <meta  
> charset="utf-16"> (in UTF-16 characters) in the document. It will just  
> be ignored by the browser when the detection algorithm is run. (see  
> below).

http://www.whatwg.org/specs/web-apps/current-work/complete/semantics.html#attr-meta-charset

links to "character encoding declaration"

http://www.whatwg.org/specs/web-apps/current-work/complete/semantics.html#character-encoding-declaration

which says among a lot of other things "the character encoding used must  
be an ASCII-compatible character encoding" which links to

http://www.whatwg.org/specs/web-apps/current-work/complete/infrastructure.html#ascii-compatible-character-encoding

which makes it clear UTF-16 is invalid. The browser will ignore it, yes.


> 3. If there is no BOM and the browser encounters the ASCII *byte*  
> sequence <meta charset="utf-16">, following the detection algorithm,  
> there is something wrong, because you wouldn't see that sequence of  
> bytes in UTF-16.  Therefore the browser assumes that the encoding is  
> actually UTF-8 and stops looking for other encoding information.

Right.


> 4. If a UTF-16 encoded doc does not start with a BOM, the meta  
> declaration will not be recognized as anything by the detection  
> algorithm, because the bytes don't match the pattern being looked for in  
> the algorithm.  A browser could, however, use heuristics after all else  
> fails to detect that the encoding is UTF-16, though this has nothing to  
> do with any meta element.

Right.


> So I think it is fine to have a meta element in a utf-16 encoded  
> document - it just won't be used by the browser for detection.  It is  
> also best that utf-16 encoded documents start with a bom, to avoid  
> reliance on browser heuristics.

It is not fine. See above.


> UTF-16 encoded XML documents, on the other hand, must start with a BOM,  
> see http://www.w3.org/TR/REC-xml/#charencoding  When the doc is treated  
> as XML, however, the meta element is ignored.
>
> Do you agree?

Not entirely, but mostly.


> [1] http://dev.w3.org/html5/spec/semantics.html#charset


-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Thursday, 15 July 2010 17:54:44 UTC