- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Fri, 23 Jul 2010 01:32:07 +0300
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: public-html <public-html@w3.org>, Eliot Graff <eliotgra@microsoft.com>, public-i18n-core@w3.org
(Changed subject line to separate the encoding issue from NCRs.) Henri Sivonen, Mon, 19 Jul 2010 06:35:02 -0700 (PDT): > Leif wrote: [ snip ] >> A possible answer to your question is found in Sam's messages [1][2]. >> He suggest only to allow UTF-8 as encoding of polyglot markup. > > That steps outside logical inferences from specs to determine what's > polyglot. To be fair, Sam's idea was perhaps more that polyglots SHOULD use UTF-8. And even if both UTF-8 and UTF-16 are polyglot encodings, it seems justified - based on inference from HTML5 - to say that polyglots SHOULD use UTF-8. Full story below. > The logical inferences lead to a conclusion that polyglot > documents can be constructed using UTF-8 and using UTF-16. Hm. According to section F.1 "Detection Without External Encoding Information" of XML 1.0, fifth edition: ]] […] each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration […] [[ And in the same spec, section 4.3.3 "Character Encoding in Entities": ]] In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: [[ Thus, inferring from the above quotations, it seems like any encoding is possible, provided one avoids the XML (encoding) declaration and instead relies on external encoding information, typically HTTP headers. Do you see any fallacy in this conclusion? > There are other reasons to prefer UTF-8 over UTF-16, but polyglotness > isn't one of them, so the WG shouldn't pretend that it is. Actually, I believe HTML5 justifies a preference for UTF-8: * HTML5 parsers MUST support UTF-8 (and Win-1252), but MAY support other encodings, including UTF-16 and UTF-32 [1]. * HTML5 says that: [2] a) authoring tools SHOULD default to UTF-8 for newly-created docs. (Roughly all polyglot markup is newly-created!) b) authors are encouraged to use UTF-8, c) conformance checkers may warn against using "legacy encodings" (btw, are UTF-16 and UTF-32 "legacy encodings"? - from the context it seems like non-UTF-8 = legacy!) d) not using UTF-8 may lead to "unexpected results on form submission and URL encodings" Thus I think we can infer from HTML5 that polyglot markup SHOULD use UTF-8. (But HTML5 does not warn against the BOM - and so Polyglot Markup can't warn against the BOM in UTF-8 either.) References (taken from the February 18th snapshot of the spec): [1] Section 10.2.2.2 Character encodings: ]] User agents must at a minimum support the UTF-8 and Windows-1252 encodings, but may support more.[[ [2] Section 4.2.5.5 Specifying the document's character encoding: ]] Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. Authoring tools should default to using UTF-8 for newly-created documents. […] Note: Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings, which use the document's character encoding by default.[[ -- leif halvard silli
Received on Friday, 23 July 2010 14:38:33 UTC