[Bug 12897] UTF-8 BOM should trump users and/or HTTP (Encoding sniffing algorithm)

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

--- Comment #10 from Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> 2011-06-09 11:27:00 UTC ---
More data collected - after discussion on www-international@ and implementation
tests:

NOTE: Data is needed for IE9's XML parser. Assumption: behaves as Webkit
(because that is how it acts for HTML)

Spec data - XML:
---

* XML 1.0 only says that Content-Type: *can* have priority (depending on what
the higher protocol says) over "<?xml version="1.0" encoding="value"?>" Quote: 
]] In the absence of information provided by an external transport protocol 
(e.g. HTTP or MIME), it is a fatal error[[
<http://www.w3.org/TR/xml/#charencoding> Thus it depends on the rules of the
higher protocal.

Spec data - RFC3023
---

1) RFC3023 'XML Media Types' specifies that HTTP charset parameter does have
priority. (Meaning that the xml parser must - legally - ignore the XML encoding
declaration.) 

2) But RFC3023 actually only justifies it for 'text/xml', where *transcoding*
(leading the doc to have another coding than the one specified inside the
document) and *compatibility with tex/plain* are the justifications:
<http://tools.ietf.org/html/rfc3023#section-3.1> 

3) For 'application/xml', then RFC3023 has no real justification. The only
thing it has is: "it is possible for users to configure web servers" and "the
HTTP spec says so". http://tools.ietf.org/html/rfc3023#section-3.2

4)    Notably, RFC3023 seriously discusses the Appendix F: "Autodetection of
Character Encodings (Non-Normative". (http://www.w3.org/TR/xml/#sec-guessing) 
Which (once again) under the heading "Priorities in the Presence of External
Encoding Information" states:  ]] In the interests of interoperability,
however, the following rule is recommended. If an XML entity is in a file, the
Byte-Order Mark and encoding declaration are used (if present) to determine the
character encoding. [[  <http://www.w3.org/TR/xml/#sec-guessing-with-ext-info>

Implementation data - RFC3023:
---

* Parsers implementing RFC3023 (HTTP has priority over document data): Opera,
Firefox, Amaya

** Parsers implementing RFC3023 and which *also* emits 'fatal errror' if HTTP
charset and UTF-8 BOM disagree: Opera, Firefox. (Thus: not Amaya.) Note: per
XML 1.0 it is required, *if HTTP and RFC3023 requires it! (and they do!)* to
ignore the XML encoding declaration in favour of the HTTP charset paramenter.
But note that it is not permitted, per XML 1.0, to act as if BOM does not
exist, even if the doc is served via HTTP!

* Parsers *not* implementing RFC3023 (thus giving priority to document data
instead), and which do not emit fatal errors: Webkit, Xerces C++, XMLMind
Editor on Mac (based on Xerces Java), RXP, oXygen

** Parsers *not* implementing RFC3023 and which, in case of conflict and
without emitting fatal error, adheres to BOM and ignores the XML encoding
declaration: Webkit, (IE9 must be checked)

** Parsers not implementing RFC3023 and which, in case of conflict and without
emitting fatal error, adheres to the XML encoding declaration and ignores the
BOM: XMLmind Editor for Mac, Xerces C++, oXygen, RXP


Implementation data - non-RFC3023 (file protocol):
---

* Parsers emitting fatal error if UTF-8 BOM conflicts with the XML encoding
declaration: Opera.

* Parsers *not* emitting fatal error if UTF-8 BOM conflicts with the XML
encoding declaration: Webkit, Firefox, oXygen, XMLmind XML editor for mac
(based on Xerces Java), Amaya

** Parsers *not* emitting fatal error if UTF-8 BOM conflicts with the XML
encoding declaration and which gives priority to UTF-8 BOM: Webkit, Firefox,
oXygen

** Parsers *not* emitting fatal error if UTF-8 BOM conflicts with the XML
encoding declaration and which gives priority to XML encoding declaration
(and/or to the UTF-8 encoding default, if they comopletely jumps over the UTF-8
BOM): XMLmind XML editor, RXP and (probably) Xerces C++


Implementation data - charset names:
---

* Webkit and some of the editiors, emit 'fatal error' if the charset *name* in
the XML encoding declaration is *unknown*. This, even if they (for example
Webkit) *otherwise* do not emit a fatal error whenever UTF-8 BOM conflicts with
the XML encoding declaration.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Thursday, 9 June 2011 11:27:03 UTC