[Bug 15359] Make BOM trump HTTP from bugzilla@jessica.w3.org on 2012-07-05 (public-html-bugzilla@w3.org from July 2012)

From: <bugzilla@jessica.w3.org>
Date: Thu, 05 Jul 2012 07:44:28 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1Smgjk-0003sJ-Gc@jessica.w3.org>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359

--- Comment #6 from theimp@iinet.net.au 2012-07-05 07:44:28 UTC ---
The charset determination rules for XML are non-normative, except for the case
you mention, where there is no BOM and no (XML) declaration and no higher-level
specifier (such as a HTTP header). This bug does not discuss this scenario
directly.

Even so, it is perfectly acceptable for valid XML processor to detect a BOM,
ignore it, and pick any encoding it likes, because technically, it's only
required to use the BOM for the specific case of picking between UTF-8 and
UTF-16, not between one of those and anything else. I could detect a UTF-16
BOM, and decide to nevertheless render it in any encoding I want *except*
UTF-8, and likewise the reverse, and it would be fully compliant:

> XML processors MUST be able to use this character [U+FEFF] to differentiate between UTF-8 and UTF-16 encoded documents.

Also:

> In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration [...] containing an encoding declaration

But, there is no special elaboration as to what "external character encoding
information" actually means, and it is not clear that "specific instruction
from the user" could not qualify. Think like command-line parameters for batch
parsers, etc.

So, a document in, really, any encoding, would not automatically be invalid XML
in the specific case of a user who said "use this encoding" (again, any
encoding).

Even if that is not the case, that does not change the fact that a processor
processing a document with, say, a UFT-8 "BOM" *and* a text declaration
specifying, say, ISO-8859-1 encoding, has no requirement to obey the BOM over
the text declaration. A BOM is required if the document is UTF-16; but other
encodings are not forbidden from having that (or any other) BOM, nor does XML
require that the BOM be considered as authoritive if present (In fact it
explicitly only recommends it).

I believe that this will typically mean that character data will appear at the
start of the document, but this is only an error, not a fatal error (and for
XML 1.0, maybe not necessarily even that; I don't really remember and will have
to check). And in fact, I think that there is wiggle-room for even on this
point.

If this character data is then interpreted as such and emitted into the HTML,
then this would of course then be an error in HTML, but that is for the HTML
spec. to deal with, which it does: the spec. currently says that what would be
an initial BOM should be ignored even if it is unrelated to the encoding.

Furthermore, not all representations of HTML5 will be XML-compatible anyway. I
very much agree with the goal of aligning HTML5 with XML; but vendors should be
left interpret this however they want if they want more robust XML processing
over legacy support; it should not be specified here.

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Thursday, 5 July 2012 07:44:32 UTC