- From: <bugzilla@jessica.w3.org>
- Date: Mon, 06 Jun 2011 20:01:59 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 Summary: UTF-8 BOM should trump users and/or HTTP (Encoding sniffing algorithm) Product: HTML WG Version: unspecified Platform: PC URL: http://dev.w3.org/html5/spec/parsing#encoding-sniffing -algorithm OS/Version: All Status: NEW Severity: normal Priority: P3 Component: HTML5 spec (editor: Ian Hickson) AssignedTo: ian@hixie.ch ReportedBy: xn--mlform-iua@xn--mlform-iua.no QAContact: public-html-bugzilla@w3.org CC: mike@w3.org, public-html-wg-issue-tracking@w3.org, public-html@w3.org PROPOSAL: ���Spec IE and Webkit's handling of the Byte Order Mark for the UTF-8 encoding as REQUIRED: Whenever the document begins with the UTF-8 Byte Order Mark, then ignore the encoding info of the HTTP "Content-Type: text/html; charset=[encodingname]" header and ignore as well any user actions to override the document's encoding. ���Consequently, ������* when there is a UTF-8 BOM, then the encoding info provided by HTTP and the user should be treated as irrelevant ������* the two first steps of the encoding sniffing algorithm must be changed CURRENT STATUS: ���The encoding sniffing algorithm two first steps give users + transporation layer (HTTP/MIME) power to override a document's character encoding: ���]] 1. If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with the confidence certain and abort these steps. ������2. If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.[[ ���HOWEVER, reality is that two mayor user agents operates with an exceptio to the above rules: Whenever the document includes the UTF-8 Byte Order Mark, then Internet Explorer and Webkit ��� - do *not* allow users to override the encoding ��� - do *not* respect the encoding information in the HTTP server's Content-Type header. ��� - do *not* permit their heuristic character dection features to guess any encoding other than UTF-8 ���Consequently, in IE and Webkit it is impossible for the user - as well as for a HTTP server - to cause a document with the UTF-8 Byte Order Mark to be intepreted as e.g. KOI8-R encoded or Windows-1252 encoded. ���In contrast, Firefox and Opera ��� - *do* obey the HTTP server's Content-Type header also when ther is a UTF-8 BOM ��� - *do* allow users to override the encoding also when ther is a UTF-8 BOM ��� - *do* permit their heuristic character dection features to guess an encoding other than UTF-8 (Opera and Firefox allow their users to tune/fiddle with how their heuristic encoding sniffing work.) ���Consequently, in Firefox and Opera it is *possible* for the user - as well as for a HTTP server - to cause a document with the UTF-8 Byte Order Mark to be intepreted as e.g. KOI8-R encoded or Windows-1252 encoded BENEFITS: ��� A. Harmonization with XML 1.0 Appendix F.2, "Priorities in the Presence of External Encoding Information", which recommends BOM to have higher priority than external encoding information: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info (Opera/Firefox do not yet implement this XML 1.0 recommendation) ��� B. a simple, reliable way to specify the UTF-8 encoding ��� C. FIrefox/Opera converge with IE/Webkit = browsers more interopable: ��� D. security: cameleon documens (where the document gets another and risky interpretation when read as legacy encoding) become more difficult to create ��� E. User experience: less "gibberish" and less "mojibake" for users [*]�http://en.wikipedia.org/wiki/Mojibake ��� F. Same as A): Promotes a polyglot way to specify the encoding: the BOM works in both HTML and XML. (The Polyglot spec already says that the UTF-8 BOM is the most polyglot enocoding method.) Other justifications: ���- Opera Software "We have introduced the BOM as requirement for each source file when we have written the build tools as a simple way to verify that all files are utf8 encoded”. (http://stackoverflow.com/questions/4658985/how-to-keep-the-bom-when-editing-files-in-espresso) NOTES: ���(1) Browsers tested as part of this bug report: IE8, Safari, Chrome (which shows above described behavior) as well as Opera and Firefox (which do support this behavior). Other browsers, e.g. KHTML, have not been tested. ���(2) BOM in UTF-16: I have not looked into how BOM in UTF-16 is handled by parsers. ���(3) For the record: All browsers, including Firefox and Opera, *do* already ignore the META charset *element* whenever ther is a UTF-8 BOM. This bug report says that they should *also* ignore HTTP. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Monday, 6 June 2011 20:02:01 UTC