- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 15 Jul 2010 18:55:15 +0400
- To: Anne van Kesteren <annevk@opera.com>
- Cc: public-html@w3.org, Richard Ishida <ishida@w3.org>
Anne van Kesteren, Tue, 13 Jul 2010 22:19:55 +0200: > On Tue, 13 Jul 2010 21:40:24 +0200, Richard Ishida <ishida@w3.org> wrote: >> Thank you for beginning work on the polyglot document. I think it >> will be very useful. FWIW, I would welcome an approach to the text >> that made it more like an author-friendly "how-to" guide, rather >> than spec text. >> >> I am about to raise 8 bugs in bugzilla. These comments have been >> discussed by the i18n WG. I hope you find them helpful. >> >> FWIW, the i18n group keeps track of comments on your doc at >> http://www.w3.org/International/reviews/1007-polyglot/ > > FWIW, using <meta charset=utf-16> is an error in HTML. First, when Sam said a _similar_ (but not same) thing, then Lachlan replied that UTF-16 is: [1] "perfectly acceptable for HTML". I think, based on my reading of what Lachlan said, that I may have advised Elliot to say that <meta charset="UTF-16"/> should be legal in polyglot markup. After seeing your comment, Anne, I now realize that <meta charset=utf-16> causes an HTML5 parser (or at least Validator.nu) to treat the document as UTF-8 (or, at any rate, it is illegal). Hence, yes, <meta charset="utf-16" /> is an error in HTML. > When used in XML its value must be UTF-8. Since XML parsers do not decide the encoding based on <meta charset="utf-16" />, then I cannot see that you are correct here. But of of course, if you are talking about _polyglot markup_, then <meta charset="utf-16" /> cannot be used, due the the problems in _HTML_ parsers. > See > http://www.whatwg.org/specs/web-apps/current-work/complete.html#attr-meta-charset If you didn't point to the 10MB complete spec, but instead had pointed to location in the multi page version, then I could actually afford looking up the spec ... Alternatively, you could have said which section it is in. I found this in section '4.2.5.5. Specifying the document's character encoding' (in a PDF copy of 18th of February 2010), which I now understand much better: ]] If an HTML document does not start with a BOM, and if its encoding is not explicitly given by Content-Type metadata, and the document is not an iframe srcdoc document, then the character encoding used must be an ASCII-compatible character encoding [[ > For HTML the requirements are more complex, but UTF-16 is not allowed I guess "more complex" translates to the following: in HTML5, then UTF-16 _is_ permitted, provided that the encoding is specified via BOM. Whereas, as told, specifying UTF-16 via <meta charset="utf-16" />, is not permitted. Pretty simple, actually. > and when specified is treated as UTF-8 (though if the document is > actually encoded as UTF-16 it would be decoded as UTF-16). In conclusion, I believe the polyglot markup spec should continue to allow UTF-16, as long as the encoding is specified via BOM. [1] http://www.w3.org/mid/4BD00FC4.8010403@lachy.id.au -- leif halvard silli
Received on Thursday, 15 July 2010 14:55:55 UTC