[Bug 15359] Make BOM trump HTTP from bugzilla@jessica.w3.org on 2012-07-05 (public-html-bugzilla@w3.org from July 2012)

From: <bugzilla@jessica.w3.org>
Date: Thu, 05 Jul 2012 16:05:01 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1SmoY9-0005eW-1g@jessica.w3.org>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359

--- Comment #9 from theimp@iinet.net.au 2012-07-05 16:05:00 UTC ---
Firstly, as I said, this bug covers only where there is a BOM, not where there
is neither a BOM nor an encoding declaration nor a header.

Secondly, the bug suggests ignoring headers when there is a BOM present, but
the XML spec. *specifically* says that "external character encoding
information" can be used to determine the encoding.

> 4.3.3 In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

So if I have:

0xFE 0xFF <?xml encoding="ISO-8859-1"?>

It is a fatal error to decode it as UTF-16. Sure, this causes other problems,
but not necessarily fatal errors (at least in XML 1.0).

Your arguments over using the UI to configure the charset are not within the
original scope of this bug.

Thirdly, as I said, not all HTML will be XML.

Those are alone enough reason not to make the proposed behavior mandatory for
all user agents in all cases.

> Firefox is probably one of the XML parsers that _best_ reflects XML's encoding rules. So if you are in doubt, I suggest that you do some experimentes for yourself.

Make it a recommendation, I do not care. I am not saying that browsers must
allow changes; I'm saying they should not be constrained from allowing them. If
they think that there is no value in allowing users to change encodings at
will, I don't see that as being a problem. But it should not be part of the
spec. If you were going to advocate requiring total compliance with XML in all
circumstances, I would be sympathetic; but apparently that is not what is
desired for HTML5.

Furthermore:

> By the way: In that case it is an illegal character per HTML5 as well: A UTF-8
document with a BOM would  be would bring the browser into Quirks-Mode if the
browser reads the document as - for example - ISO-8859-1.

Yes, it would typically trigger Quirks mode (except in some, perhaps only
theoretical, encodings). That's not a fatal error though.

> Section '4.3.3 Character Encoding in Entities' is NORMATIVE.

Yes.

> Whereas the section you talk appear to be talking about, is Appendix F. [2] Appendix F only contains helpful instructions/tips for how to fulfill the requirements of section 4.3.3.

No, I only mention Appendix F once, to say that it is non-normative. Every part
of the spec. that I actually quoted is from Section 4.3.3.

The following:

> XML processors MUST be able to use this character [U+FEFF] to differentiate between UTF-8 and UTF-16 encoded documents.

and:

> In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration [...] containing an encoding declaration

Are in Section 4.3.3.

It's really great that you are passionate about XML, but you must be careful
only to read what the spec. actually says.

Let's take the example of a ISO-8859-1 with a UTF-16 BOM. In section 4.4.3:

> It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process.

Not a problem, the browser supports ISO-8859-1.

> It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding.

Not a problem, both 0xFE and 0xFF are legal encoding sequences in ISO-8859-1.

> Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode.

Not relevant.

> Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

Not a problem, "higher-level protocol" could include "user configuration",
since this term is not defined anywhere. Even if this is not the case, there is
still the scenario where there is both a BOM and an encoding declaration - it
does not say that the BOM should trump the encoding declaration. Doing so 

Anything else is at worst an error, not a fatal error.

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Thursday, 5 July 2012 16:05:09 UTC