[Bug 9962] New: Character Encoding

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9962

           Summary: Character Encoding
           Product: HTML WG
           Version: unspecified
          Platform: All
               URL: http://dev.w3.org/html5/html-xhtml-author-guide/html-x
                    html-authoring-guide.html#character-encoding
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML/XHTML Compatibility Authoring Guide (ed: Eliot
                    Graff)
        AssignedTo: eliotgra@microsoft.com
        ReportedBy: xn--mlform-iua@xn--mlform-iua.no
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html@w3.org,
                    xn--mlform-iua@xn--mlform-iua.no,
                    eliotgra@microsoft.com


Replace the current section about encodings, with something like this:

(The justification is given below this proposal)

]]
3. Character Encodings

For HTML-compatibility, declaring the encoding via the XML declaration is
forbidden – it has no effect in HTML and can trigger Quirks-Mode in some HTML
parsers. Only the default encodings of XML — UTF-8 and UTF-16 — are thus
permitted in polyglots. Whereas only UTF-8 is a RECOMMENDED encoding.  Most
HTML parsers however defaults to Windows-1252 or another 8-bit encoding. Thus,
for HTML-compatibility, the choice between UTF-8 or UTF-16 MUST be declared. 

There are two ways to declare the choice of encoding. Either via the meta
charset element — this only has effect in HTML parsers:

<meta charset="utf-8"/>

Or by using the BOM. The BOM has effect in both HTML and XML parsers. But note
that using the BOM is reported to have some legacy issues in very old HTML
parsers.  

It is not forbidden to use <meta charset="*"/> in combination with BOM, as long
as it specifies the same as the BOM.

To specify the encoding via the <code>meta</code>
<code>http-equiv="Content-Type"</code> meta element is confusing and NOT
RECOMMENDED and SHOULD trigger a warning in polyglot validators as this element
declares the Content-Type to be <code class="MIME">text/html</code> — in rare
cases (for example if a file read via the file URL protocol is lacking an xhtml
extension, this could affect whether the document is processed as
<code>text/html</code> or <code>application/xhtml+xml</code>. 

<span class="taken_from_HTML5">Note: Using non-UTF-8 can have unexpected
results on form submission and URL encodings, which use the document's
character encoding by default.</span> But the reason why the polyglot spec
forbids other encodings than UTF-8 and UTF-16 is that, with the exception of
using the BOM (which has some legacy issues and which only can be used to
declare UTF-8 and UTF-16 encodings), there does not exist any polyglot way to
declare the encoding of a document.

When UTF-16 is used, the document should include the BOM indicating UTF-16LE or
UTF-16BE. 
[[



JUSTIFICATION: The above proposal aims to solve the following problems with the
current text:
---------------------------------------------------

<q>
3. Character Encoding<ins>s</ins>
</q>

JUSTIFICATION: HTML5 users plural in its corresponding heading. *And* you do
discuss more than a single encoding.

FOR CONSIDERATION: HTML5 has one section ("Character encodings") where it talks
about encodings, and another section where it speaks about "Specifying the
document's character encoding". This section is about the latter. It could be
thinkable to reflect this in the title. But I don't have any proposal for not.

<q>
A polyglot document uses either UTF-8 or UTF-16, although generally UTF-8 is
preferred.
</q>

COMMENT: AT the bottom of this section, you say <q>If a polyglot document uses
an encoding other than UTF-8 or UTF-16 […]<q>. If other encodings is an
options, then then saying that they user either UTF-8 or UTF-16 isn't accurate.

<q>If a polyglot document uses UTF-16, it should include the BOM indicating
UTF-16LE or UTF-16BE. In addition, a polyglot document need not include the
meta charset declaration, because the parser would have to read UTF-16 in order
to parse it by definition.</q>

COMMENT: I get the impression that these 2 sentences speaks only about UTF-16.
However, it is not very clear that this is the case. Also, in the midtst of
this, you talk about the meta element - which is part of why it is unclear
whether you talk only about UTF-16 or more general.

<q>
In short, for correct character encoding, a polyglot document must either:
</q>

COMMENT: I wonder about the user of "MUST", at least when I look at what
follows.  

<q>
Use UTF-8 or UTF-16 with the appropriate BOM.
</q>

COMMENT: It is unclear whether the advice about "appropriate BOM" also relates
to UTF-8. Note that the I18N WG claims that there are compatibility issues with
regard to BOM, for some legacy user agents – though I must recheck how legacy
those useragent are ...

<blockquote>
OR
Use both the XML Declaration and meta tag to specify the appropriate character
encoding.
</blockquote>

COMMENT: Using the XML Declaration triggers quirks-mode in legacy IE - in fact,
it may trigger quirks even in IE8! (If you do it right – of if you wish - if
you do it "wrong". [I can document it if you wish.] Therefore perhaps the need
to use the XML declaration should be deleted (= only allow UTF-8/UTF-16). There
more I think about it, the more I tihnk we should forbid the XML declaration
and only allow UTF-8.

<q>If a polyglot document uses an encoding other than UTF-8 or UTF-16, it must
include the XML declaration; however, in this case the document must also
include the HTML meta tag specifying the character set. When a polyglot
document uses both the XML declaration and the HTML meta tag, these must
specify the same character and coding.</q>

COMMENT 1: See previous comment. Other encodings than UTF8/UTF16 should be
forbidden. However, that does not mean that we do not need to specify the use
of the meta charset element. Remember that HTML documents defaults to an 8-bit
encoding - most often to Windows 1252.  

COMMENT 2: You do not mention the better option: to send the encoding info as a
HTTP header. If one do that, then one may in fact skip the XML declaration also
for non-UTF-8 encodings.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Sunday, 20 June 2010 21:35:41 UTC