- From: Richard Ishida <ishida@w3.org>
- Date: Thu, 15 Jul 2010 17:18:02 +0100
- To: "'Anne van Kesteren'" <annevk@opera.com>
- Cc: <public-html@w3.org>, "'Leif Halvard Silli'" <xn--mlform-iua@xn--mlform-iua.no>
Hi Anne, My understanding from reading section 4.2.5.5 of the HTML5 spec[1] is the following: 1. An HTML5 document can be served as UTF-16. The following points assume that no encoding info is supplied in the HTTP header. 2. If an html doc starts with a UTF-16 shaped BOM, any subsequent meta encoding declarations are ignored. The browser sees the BOM and stops looking for encoding information. I don't, however, see anything prohibiting the use of the UTF-16 encoded *character* sequence <meta charset="utf-16"> (in UTF-16 characters) in the document. It will just be ignored by the browser when the detection algorithm is run. (see below). 3. If there is no BOM and the browser encounters the ASCII *byte* sequence <meta charset="utf-16">, following the detection algorithm, there is something wrong, because you wouldn't see that sequence of bytes in UTF-16. Therefore the browser assumes that the encoding is actually UTF-8 and stops looking for other encoding information. 4. If a UTF-16 encoded doc does not start with a BOM, the meta declaration will not be recognized as anything by the detection algorithm, because the bytes don't match the pattern being looked for in the algorithm. A browser could, however, use heuristics after all else fails to detect that the encoding is UTF-16, though this has nothing to do with any meta element. So I think it is fine to have a meta element in a utf-16 encoded document - it just won't be used by the browser for detection. It is also best that utf-16 encoded documents start with a bom, to avoid reliance on browser heuristics. UTF-16 encoded XML documents, on the other hand, must start with a BOM, see http://www.w3.org/TR/REC-xml/#charencoding When the doc is treated as XML, however, the meta element is ignored. Do you agree? RI [1] http://dev.w3.org/html5/spec/semantics.html#charset ============ Richard Ishida Internationalization Lead W3C (World Wide Web Consortium) http://www.w3.org/International/ http://rishida.net/ > -----Original Message----- > From: Anne van Kesteren [mailto:annevk@opera.com] > Sent: 13 July 2010 21:20 > To: public-html@w3.org; Richard Ishida > Subject: Re: i18n comments on Polyglot Markup > > On Tue, 13 Jul 2010 21:40:24 +0200, Richard Ishida <ishida@w3.org> wrote: > > Thank you for beginning work on the polyglot document. I think it will > > be very useful. FWIW, I would welcome an approach to the text that made > > it more like an author-friendly "how-to" guide, rather than spec text. > > > > I am about to raise 8 bugs in bugzilla. These comments have been > > discussed by the i18n WG. I hope you find them helpful. > > > > FWIW, the i18n group keeps track of comments on your doc at > > http://www.w3.org/International/reviews/1007-polyglot/ > > FWIW, using <meta charset=utf-16> is an error in HTML. When used in XML > its value must be UTF-8. See > http://www.whatwg.org/specs/web-apps/current-work/complete.html#attr- > meta-charset > > For HTML the requirements are more complex, but UTF-16 is not allowed and > when specified is treated as UTF-8 (though if the document is actually > encoded as UTF-16 it would be decoded as UTF-16). > > > -- > Anne van Kesteren > http://annevankesteren.nl/ > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 9.0.830 / Virus Database: 271.1.1/3001 - Release Date: 07/12/10 > 17:49:00
Received on Thursday, 15 July 2010 16:18:41 UTC