W3C home > Mailing lists > Public > public-html-bugzilla@w3.org > June 2011

[Bug 12897] UTF-8 BOM should trump users and/or HTTP (Encoding sniffing algorithm)

From: <bugzilla@jessica.w3.org>
Date: Tue, 07 Jun 2011 11:53:07 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1QTuqJ-0001tv-VK@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

--- Comment #8 from Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> 2011-06-07 11:53:05 UTC ---
(In reply to comment #7)

> I believe you are misreading the XML 1.0 spec. It says that in the HTTP case,
> RFC 3023 applies but for anyone specifying a new case, they recommend giving
> XML itself precedence. However, since the RFC applies in the HTTP case, in the
> HTTP case, the charset parameter on the HTTP level is authoritative.

(1) It is already great if we agree that about the interpretation whenever HTTP
 is *not* used!

(2) In that regard, HTML5 tends to talk about "the higher protocol" and not
specifically about HTTP.

(3) It is in the power of HTML5 spec to specify how XHTML5 and HTML5 document
should be interpreted. Because: 

  a) the HTML5 effort (including "sister projects") looks as
redefining/refining the HTTP specs as well as HTML itself. 

  b) XML 1.0 defers it: "the preferred method of handling conflict  should be
specified as part of the higher-level protocol used to deliver XML"

  c) XML 1.0 defines a recommended rule (which it probably would like to see in
HTTP as well): "If an XML entity is in a file, the Byte-Order Mark and encoding
declaration are used (if present) to determine the character encoding."

But apart from what XML says, we must also look at interoperatibility - and the
effects of Opera and Mozilla's reading of the specifications.

  I) In Mozilla's bugzilla there are several reports about how to handle the
BOM gibberish letters whenever the BOM is ignored in favor of an external
protocol.

  II) Opera has implemented a very strange behaviour were it sometimes eats the
BOM gibberish, so that the page does not go in to quirks-mode, whereas
sometimes it does not eat the BOM gibberish, leading to quirks mode. See my
tests: http://malform.no/testing/html5/bom/ 

   Et cetera: Yellow Screen of Death, IE/Webkit, wrong resulting encoding. 

   I don't know if I misread Julian, but I'll also quote a message to Adam in
2009: [*]

]]
   > The algorithm tolerates leading white space, but not leading BOMs.

   Is there a particular reason why the BOM is not tolerated, given 
   <http://www.w3.org/TR/REC-xml/#sec-guessing>?
               [ snipping in Julian's message ]
   Let's ignore "correctly" for a second -- [ snipping ]
]]

[*] http://lists.w3.org/Archives/Public/public-html/2009Nov/0579

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Tuesday, 7 June 2011 11:53:09 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 20:01:52 UTC