- From: <bugzilla@jessica.w3.org>
- Date: Sun, 20 Jun 2010 21:35:39 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9962 Summary: Character Encoding Product: HTML WG Version: unspecified Platform: All URL: http://dev.w3.org/html5/html-xhtml-author-guide/html-x html-authoring-guide.html#character-encoding OS/Version: All Status: NEW Severity: normal Priority: P2 Component: HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) AssignedTo: eliotgra@microsoft.com ReportedBy: xn--mlform-iua@xn--mlform-iua.no QAContact: public-html-bugzilla@w3.org CC: mike@w3.org, public-html@w3.org, xn--mlform-iua@xn--mlform-iua.no, eliotgra@microsoft.com Replace the current section about encodings, with something like this: (The justification is given below this proposal) ]] 3. Character Encodings For HTML-compatibility, declaring the encoding via the XML declaration is forbidden – it has no effect in HTML and can trigger Quirks-Mode in some HTML parsers. Only the default encodings of XML — UTF-8 and UTF-16 — are thus permitted in polyglots. Whereas only UTF-8 is a RECOMMENDED encoding. Most HTML parsers however defaults to Windows-1252 or another 8-bit encoding. Thus, for HTML-compatibility, the choice between UTF-8 or UTF-16 MUST be declared. There are two ways to declare the choice of encoding. Either via the meta charset element — this only has effect in HTML parsers: <meta charset="utf-8"/> Or by using the BOM. The BOM has effect in both HTML and XML parsers. But note that using the BOM is reported to have some legacy issues in very old HTML parsers. It is not forbidden to use <meta charset="*"/> in combination with BOM, as long as it specifies the same as the BOM. To specify the encoding via the <code>meta</code> <code>http-equiv="Content-Type"</code> meta element is confusing and NOT RECOMMENDED and SHOULD trigger a warning in polyglot validators as this element declares the Content-Type to be <code class="MIME">text/html</code> — in rare cases (for example if a file read via the file URL protocol is lacking an xhtml extension, this could affect whether the document is processed as <code>text/html</code> or <code>application/xhtml+xml</code>. <span class="taken_from_HTML5">Note: Using non-UTF-8 can have unexpected results on form submission and URL encodings, which use the document's character encoding by default.</span> But the reason why the polyglot spec forbids other encodings than UTF-8 and UTF-16 is that, with the exception of using the BOM (which has some legacy issues and which only can be used to declare UTF-8 and UTF-16 encodings), there does not exist any polyglot way to declare the encoding of a document. When UTF-16 is used, the document should include the BOM indicating UTF-16LE or UTF-16BE. [[ JUSTIFICATION: The above proposal aims to solve the following problems with the current text: --------------------------------------------------- <q> 3. Character Encoding<ins>s</ins> </q> JUSTIFICATION: HTML5 users plural in its corresponding heading. *And* you do discuss more than a single encoding. FOR CONSIDERATION: HTML5 has one section ("Character encodings") where it talks about encodings, and another section where it speaks about "Specifying the document's character encoding". This section is about the latter. It could be thinkable to reflect this in the title. But I don't have any proposal for not. <q> A polyglot document uses either UTF-8 or UTF-16, although generally UTF-8 is preferred. </q> COMMENT: AT the bottom of this section, you say <q>If a polyglot document uses an encoding other than UTF-8 or UTF-16 […]<q>. If other encodings is an options, then then saying that they user either UTF-8 or UTF-16 isn't accurate. <q>If a polyglot document uses UTF-16, it should include the BOM indicating UTF-16LE or UTF-16BE. In addition, a polyglot document need not include the meta charset declaration, because the parser would have to read UTF-16 in order to parse it by definition.</q> COMMENT: I get the impression that these 2 sentences speaks only about UTF-16. However, it is not very clear that this is the case. Also, in the midtst of this, you talk about the meta element - which is part of why it is unclear whether you talk only about UTF-16 or more general. <q> In short, for correct character encoding, a polyglot document must either: </q> COMMENT: I wonder about the user of "MUST", at least when I look at what follows. <q> Use UTF-8 or UTF-16 with the appropriate BOM. </q> COMMENT: It is unclear whether the advice about "appropriate BOM" also relates to UTF-8. Note that the I18N WG claims that there are compatibility issues with regard to BOM, for some legacy user agents – though I must recheck how legacy those useragent are ... <blockquote> OR Use both the XML Declaration and meta tag to specify the appropriate character encoding. </blockquote> COMMENT: Using the XML Declaration triggers quirks-mode in legacy IE - in fact, it may trigger quirks even in IE8! (If you do it right – of if you wish - if you do it "wrong". [I can document it if you wish.] Therefore perhaps the need to use the XML declaration should be deleted (= only allow UTF-8/UTF-16). There more I think about it, the more I tihnk we should forbid the XML declaration and only allow UTF-8. <q>If a polyglot document uses an encoding other than UTF-8 or UTF-16, it must include the XML declaration; however, in this case the document must also include the HTML meta tag specifying the character set. When a polyglot document uses both the XML declaration and the HTML meta tag, these must specify the same character and coding.</q> COMMENT 1: See previous comment. Other encodings than UTF8/UTF16 should be forbidden. However, that does not mean that we do not need to specify the use of the meta charset element. Remember that HTML documents defaults to an 8-bit encoding - most often to Windows 1252. COMMENT 2: You do not mention the better option: to send the encoding info as a HTTP header. If one do that, then one may in fact skip the XML declaration also for non-UTF-8 encodings. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Sunday, 20 June 2010 21:35:41 UTC