[Bug 12897] New: UTF-8 BOM should trump users and/or HTTP (Encoding sniffing algorithm)

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897

           Summary: UTF-8 BOM should trump users and/or HTTP (Encoding
                    sniffing algorithm)
           Product: HTML WG
           Version: unspecified
          Platform: PC
               URL: http://dev.w3.org/html5/spec/parsing#encoding-sniffing
                    -algorithm
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P3
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: xn--mlform-iua@xn--mlform-iua.no
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


PROPOSAL: 
   Spec IE and Webkit's handling of the Byte Order Mark for the UTF-8 encoding
as  REQUIRED:   Whenever the document begins with the UTF-8 Byte Order Mark,
then ignore the encoding info of the HTTP "Content-Type: text/html;
charset=[encodingname]" header and ignore as well any user actions to override
the document's encoding.
   Consequently, 
      * when there is a UTF-8 BOM,  then the encoding info provided by HTTP and
the user should be treated as irrelevant
      * the two first steps of the encoding sniffing algorithm must be changed 

CURRENT STATUS: 
   The encoding sniffing algorithm two first steps give users + transporation
layer (HTTP/MIME) power to override a document's character encoding:

   ]] 1. If the user has explicitly instructed the user agent to override the
document's character encoding with a specific encoding, optionally return that
encoding with the confidence certain and abort these steps.
      2. If the transport layer specifies an encoding, and it is supported,
return that encoding with the confidence certain, and abort these steps.[[

   HOWEVER, reality is that two mayor user agents operates with an exceptio to
the above rules:  Whenever the document includes the UTF-8 Byte Order Mark,
then Internet Explorer and Webkit  
    - do *not* allow users to override the encoding
    - do *not* respect the encoding information in the HTTP server's
Content-Type header.
    - do *not* permit their heuristic character dection features to guess any
encoding other than UTF-8
   Consequently, in IE and Webkit it is impossible for the user - as well as
for a HTTP server -  to cause a document with the UTF-8 Byte Order Mark to be
intepreted as e.g. KOI8-R encoded or Windows-1252 encoded.

   In contrast, Firefox and Opera
    - *do* obey the HTTP server's Content-Type header also when ther is a UTF-8
BOM 
    - *do* allow users to override the encoding also when ther is a UTF-8 BOM
    - *do* permit their heuristic character dection features to guess an
encoding other than UTF-8 (Opera and Firefox allow their users to tune/fiddle
with how their heuristic encoding sniffing work.)
   Consequently, in Firefox and Opera it is *possible* for the user - as well
as for a HTTP server -  to cause a document with the UTF-8 Byte Order Mark to
be intepreted as e.g. KOI8-R encoded or Windows-1252 encoded

BENEFITS:
    A. Harmonization with XML 1.0  Appendix F.2, "Priorities in the Presence of
External Encoding Information", which recommends BOM to have higher priority
than external encoding information:
http://www.w3.org/TR/xml/#sec-guessing-with-ext-info  (Opera/Firefox do not yet
implement this XML 1.0 recommendation)
    B. a simple, reliable way to specify the UTF-8 encoding
    C. FIrefox/Opera converge with IE/Webkit = browsers more interopable: 
    D. security: cameleon documens (where the document gets another and risky
interpretation when read as legacy encoding) become more difficult to create
    E. User experience: less "gibberish" and less "mojibake" for users
[*] http://en.wikipedia.org/wiki/Mojibake
    F. Same as A): Promotes a polyglot way to specify the encoding: the BOM
works in both HTML and XML. (The Polyglot spec already says that the UTF-8 BOM 
is the most polyglot enocoding method.)

Other justifications:
   - Opera Software "We have introduced the BOM as requirement for each source
file when we have written the build tools as a simple way to verify that all
files are utf8 encoded”.
(http://stackoverflow.com/questions/4658985/how-to-keep-the-bom-when-editing-files-in-espresso)

NOTES: 

   (1) Browsers tested as part of this bug report: IE8, Safari, Chrome (which
shows above described behavior) as well as Opera and Firefox (which do support
this behavior). Other browsers, e.g. KHTML, have not been tested.
   (2) BOM in UTF-16: I have not looked into how BOM in UTF-16 is handled by
parsers.
   (3)  For the record: All browsers, including Firefox and Opera, *do* already
ignore the META charset *element* whenever ther is a UTF-8 BOM. This bug report
says that they should *also* ignore HTTP.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Monday, 6 June 2011 20:02:01 UTC