- From: <bugzilla@jessica.w3.org>
- Date: Thu, 05 Jul 2012 07:44:28 +0000
- To: public-html-bugzilla@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359 --- Comment #6 from theimp@iinet.net.au 2012-07-05 07:44:28 UTC --- The charset determination rules for XML are non-normative, except for the case you mention, where there is no BOM and no (XML) declaration and no higher-level specifier (such as a HTTP header). This bug does not discuss this scenario directly. Even so, it is perfectly acceptable for valid XML processor to detect a BOM, ignore it, and pick any encoding it likes, because technically, it's only required to use the BOM for the specific case of picking between UTF-8 and UTF-16, not between one of those and anything else. I could detect a UTF-16 BOM, and decide to nevertheless render it in any encoding I want *except* UTF-8, and likewise the reverse, and it would be fully compliant: > XML processors MUST be able to use this character [U+FEFF] to differentiate between UTF-8 and UTF-16 encoded documents. Also: > In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration [...] containing an encoding declaration But, there is no special elaboration as to what "external character encoding information" actually means, and it is not clear that "specific instruction from the user" could not qualify. Think like command-line parameters for batch parsers, etc. So, a document in, really, any encoding, would not automatically be invalid XML in the specific case of a user who said "use this encoding" (again, any encoding). Even if that is not the case, that does not change the fact that a processor processing a document with, say, a UFT-8 "BOM" *and* a text declaration specifying, say, ISO-8859-1 encoding, has no requirement to obey the BOM over the text declaration. A BOM is required if the document is UTF-16; but other encodings are not forbidden from having that (or any other) BOM, nor does XML require that the BOM be considered as authoritive if present (In fact it explicitly only recommends it). I believe that this will typically mean that character data will appear at the start of the document, but this is only an error, not a fatal error (and for XML 1.0, maybe not necessarily even that; I don't really remember and will have to check). And in fact, I think that there is wiggle-room for even on this point. If this character data is then interpreted as such and emitted into the HTML, then this would of course then be an error in HTML, but that is for the HTML spec. to deal with, which it does: the spec. currently says that what would be an initial BOM should be ignored even if it is unrelated to the encoding. Furthermore, not all representations of HTML5 will be XML-compatible anyway. I very much agree with the goal of aligning HTML5 with XML; but vendors should be left interpret this however they want if they want more robust XML processing over legacy support; it should not be specified here. -- Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Thursday, 5 July 2012 07:44:32 UTC