- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Fri, 23 Jul 2010 01:32:07 +0300
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: public-html <public-html@w3.org>, Eliot Graff <eliotgra@microsoft.com>, public-i18n-core@w3.org
(Changed subject line to separate the encoding issue from NCRs.)
Henri Sivonen, Mon, 19 Jul 2010 06:35:02 -0700 (PDT):
> Leif wrote:
[ snip ]
>> A possible answer to your question is found in Sam's messages [1][2].
>> He suggest only to allow UTF-8 as encoding of polyglot markup.
>
> That steps outside logical inferences from specs to determine what's
> polyglot.
To be fair, Sam's idea was perhaps more that polyglots SHOULD use
UTF-8. And even if both UTF-8 and UTF-16 are polyglot encodings, it
seems justified - based on inference from HTML5 - to say that polyglots
SHOULD use UTF-8. Full story below.
> The logical inferences lead to a conclusion that polyglot
> documents can be constructed using UTF-8 and using UTF-16.
Hm. According to section F.1 "Detection Without External Encoding
Information" of XML 1.0, fifth edition:
]] […] each XML entity not accompanied by external encoding
information and not in UTF-8 or UTF-16 encoding must begin with an XML
encoding declaration […] [[
And in the same spec, section 4.3.3 "Character Encoding in Entities":
]] In the absence of external character encoding information (such as
MIME headers), parsed entities which are stored in an encoding other
than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The
Text Declaration) containing an encoding declaration: [[
Thus, inferring from the above quotations, it seems like any encoding
is possible, provided one avoids the XML (encoding) declaration and
instead relies on external encoding information, typically HTTP headers.
Do you see any fallacy in this conclusion?
> There are other reasons to prefer UTF-8 over UTF-16, but polyglotness
> isn't one of them, so the WG shouldn't pretend that it is.
Actually, I believe HTML5 justifies a preference for UTF-8:
* HTML5 parsers MUST support UTF-8 (and Win-1252), but MAY
support other encodings, including UTF-16 and UTF-32 [1].
* HTML5 says that: [2]
a) authoring tools SHOULD default to UTF-8 for newly-created
docs. (Roughly all polyglot markup is newly-created!)
b) authors are encouraged to use UTF-8,
c) conformance checkers may warn against using "legacy
encodings" (btw, are UTF-16 and UTF-32 "legacy encodings"?
- from the context it seems like non-UTF-8 = legacy!)
d) not using UTF-8 may lead to "unexpected results on form
submission and URL encodings"
Thus I think we can infer from HTML5 that polyglot markup SHOULD use
UTF-8. (But HTML5 does not warn against the BOM - and so Polyglot
Markup can't warn against the BOM in UTF-8 either.)
References (taken from the February 18th snapshot of the spec):
[1] Section 10.2.2.2 Character encodings:
]] User agents must at a minimum support the UTF-8 and
Windows-1252 encodings, but may support more.[[
[2] Section 4.2.5.5 Specifying the document's character encoding:
]] Authors are encouraged to use UTF-8. Conformance checkers
may advise authors against using legacy encodings.
Authoring tools should default to using UTF-8 for
newly-created documents. […]
Note: Using non-UTF-8 encodings can have unexpected results
on form submission and URL encodings, which use the
document's character encoding by default.[[
--
leif halvard silli
Received on Friday, 23 July 2010 14:38:33 UTC