i18n Polyglot Markup/Encodings

(Changed subject line to separate the encoding issue from NCRs.)

Henri Sivonen, Mon, 19 Jul 2010 06:35:02 -0700 (PDT):
> Leif wrote:
   [ snip ]
>> A possible answer to your question is found in Sam's messages [1][2].
>> He suggest only to allow UTF-8 as encoding of polyglot markup.
> 
> That steps outside logical inferences from specs to determine what's 
> polyglot.

To be fair, Sam's idea was perhaps more that polyglots SHOULD use 
UTF-8. And even if both UTF-8 and UTF-16 are polyglot encodings, it 
seems justified - based on inference from HTML5 - to say that polyglots 
SHOULD use UTF-8. Full story below.

> The logical inferences lead to a conclusion that polyglot 
> documents can be constructed using UTF-8 and using UTF-16.

Hm. According to section F.1 "Detection Without External Encoding 
Information" of XML 1.0, fifth edition:

	]] […] each XML entity not accompanied by external encoding 
information and not in UTF-8 or UTF-16 encoding must begin with an XML 
encoding declaration […] [[

And in the same spec, section 4.3.3 "Character Encoding in Entities":

	]] In the absence of external character encoding information (such as 
MIME headers), parsed entities which are stored in an encoding other 
than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The 
Text Declaration) containing an encoding declaration: [[

Thus, inferring from the above quotations, it seems like any encoding 
is possible, provided one avoids the XML (encoding) declaration and 
instead relies on external encoding information, typically HTTP headers.

Do you see any fallacy in this conclusion?

> There are other reasons to prefer UTF-8 over UTF-16, but polyglotness 
> isn't one of them, so the WG shouldn't pretend that it is.

Actually, I believe HTML5 justifies a preference for UTF-8:

  * HTML5 parsers MUST support UTF-8 (and Win-1252), but MAY
    support other encodings, including UTF-16 and UTF-32 [1].
  * HTML5 says that: [2] 
    a) authoring tools SHOULD default to UTF-8 for newly-created 
       docs. (Roughly all polyglot markup is newly-created!)
    b) authors are encouraged to use UTF-8, 
    c) conformance checkers may warn against using "legacy
       encodings" (btw, are UTF-16 and UTF-32 "legacy encodings"?
       - from the context it seems like non-UTF-8 = legacy!)
    d) not using UTF-8 may lead to "unexpected results on form 
       submission and URL encodings"

Thus I think we can infer from HTML5 that polyglot markup SHOULD use 
UTF-8. (But HTML5 does not warn against the BOM - and so Polyglot 
Markup can't warn against the BOM in UTF-8 either.)

References (taken from the February 18th snapshot of the spec):

[1] Section 10.2.2.2 Character encodings: 
    ]] User agents must at a minimum support the UTF-8 and
       Windows-1252 encodings, but may support more.[[

[2] Section 4.2.5.5 Specifying the document's character encoding:
    ]] Authors are encouraged to use UTF-8. Conformance checkers
       may advise authors against using legacy encodings.
       Authoring tools should default to using UTF-8 for 
       newly-created documents. […] 
       Note: Using non-UTF-8 encodings can have unexpected results
       on form submission and URL encodings, which use the 
       document's character encoding by default.[[
-- 
leif halvard silli

Received on Friday, 23 July 2010 14:38:33 UTC