[Bug 12897] New: UTF-8 BOM should trump users and/or HTTP (Encoding sniffing algorithm) from bugzilla@jessica.w3.org on 2011-06-06 (public-html-bugzilla@w3.org from June 2011)

From: <bugzilla@jessica.w3.org>
Date: Mon, 06 Jun 2011 20:01:59 +0000
To: public-html-bugzilla@w3.org
Message-ID: <bug-12897-2486@http.www.w3.org/Bugs/Public/>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

           Summary: UTF-8 BOM should trump users and/or HTTP (Encoding
                    sniffing algorithm)
           Product: HTML WG
           Version: unspecified
          Platform: PC
               URL: http://dev.w3.org/html5/spec/parsing#encoding-sniffing
                    -algorithm
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P3
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: xn--mlform-iua@xn--mlform-iua.no
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


PROPOSAL: 
���Spec IE and Webkit's handling of the Byte Order Mark for the UTF-8 encoding
as  REQUIRED:   Whenever the document begins with the UTF-8 Byte Order Mark,
then ignore the encoding info of the HTTP "Content-Type: text/html;
charset=[encodingname]" header and ignore as well any user actions to override
the document's encoding.
���Consequently, 
������* when there is a UTF-8 BOM,  then the encoding info provided by HTTP and
the user should be treated as irrelevant
������* the two first steps of the encoding sniffing algorithm must be changed 

CURRENT STATUS: 
���The encoding sniffing algorithm two first steps give users + transporation
layer (HTTP/MIME) power to override a document's character encoding:

���]] 1. If the user has explicitly instructed the user agent to override the
document's character encoding with a specific encoding, optionally return that
encoding with the confidence certain and abort these steps.
������2. If the transport layer specifies an encoding, and it is supported,
return that encoding with the confidence certain, and abort these steps.[[

���HOWEVER, reality is that two mayor user agents operates with an exceptio to
the above rules:  Whenever the document includes the UTF-8 Byte Order Mark,
then Internet Explorer and Webkit  
��� - do *not* allow users to override the encoding
��� - do *not* respect the encoding information in the HTTP server's
Content-Type header.
��� - do *not* permit their heuristic character dection features to guess any
encoding other than UTF-8
���Consequently, in IE and Webkit it is impossible for the user - as well as
for a HTTP server -  to cause a document with the UTF-8 Byte Order Mark to be
intepreted as e.g. KOI8-R encoded or Windows-1252 encoded.

���In contrast, Firefox and Opera
��� - *do* obey the HTTP server's Content-Type header also when ther is a UTF-8
BOM 
��� - *do* allow users to override the encoding also when ther is a UTF-8 BOM
��� - *do* permit their heuristic character dection features to guess an
encoding other than UTF-8 (Opera and Firefox allow their users to tune/fiddle
with how their heuristic encoding sniffing work.)
���Consequently, in Firefox and Opera it is *possible* for the user - as well
as for a HTTP server -  to cause a document with the UTF-8 Byte Order Mark to
be intepreted as e.g. KOI8-R encoded or Windows-1252 encoded

BENEFITS:
��� A. Harmonization with XML 1.0  Appendix F.2, "Priorities in the Presence of
External Encoding Information", which recommends BOM to have higher priority
than external encoding information:
http://www.w3.org/TR/xml/#sec-guessing-with-ext-info  (Opera/Firefox do not yet
implement this XML 1.0 recommendation)
��� B. a simple, reliable way to specify the UTF-8 encoding
��� C. FIrefox/Opera converge with IE/Webkit = browsers more interopable: 
��� D. security: cameleon documens (where the document gets another and risky
interpretation when read as legacy encoding) become more difficult to create
��� E. User experience: less "gibberish" and less "mojibake" for users
[*]�http://en.wikipedia.org/wiki/Mojibake
��� F. Same as A): Promotes a polyglot way to specify the encoding: the BOM
works in both HTML and XML. (The Polyglot spec already says that the UTF-8 BOM 
is the most polyglot enocoding method.)

Other justifications:
���- Opera Software "We have introduced the BOM as requirement for each source
file when we have written the build tools as a simple way to verify that all
files are utf8 encoded&#8221;.
(http://stackoverflow.com/questions/4658985/how-to-keep-the-bom-when-editing-files-in-espresso)

NOTES: 

���(1) Browsers tested as part of this bug report: IE8, Safari, Chrome (which
shows above described behavior) as well as Opera and Firefox (which do support
this behavior). Other browsers, e.g. KHTML, have not been tested.
���(2) BOM in UTF-16: I have not looked into how BOM in UTF-16 is handled by
parsers.
���(3)  For the record: All browsers, including Firefox and Opera, *do* already
ignore the META charset *element* whenever ther is a UTF-8 BOM. This bug report
says that they should *also* ignore HTTP.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Monday, 6 June 2011 20:02:01 UTC