- From: <bugzilla@jessica.w3.org>
- Date: Mon, 06 Jun 2011 20:01:59 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 Summary: UTF-8 BOM should trump users and/or HTTP (Encoding sniffing algorithm) Product: HTML WG Version: unspecified Platform: PC URL: http://dev.w3.org/html5/spec/parsing#encoding-sniffing -algorithm OS/Version: All Status: NEW Severity: normal Priority: P3 Component: HTML5 spec (editor: Ian Hickson) AssignedTo: ian@hixie.ch ReportedBy: xn--mlform-iua@xn--mlform-iua.no QAContact: public-html-bugzilla@w3.org CC: mike@w3.org, public-html-wg-issue-tracking@w3.org, public-html@w3.org PROPOSAL: Spec IE and Webkit's handling of the Byte Order Mark for the UTF-8 encoding as REQUIRED: Whenever the document begins with the UTF-8 Byte Order Mark, then ignore the encoding info of the HTTP "Content-Type: text/html; charset=[encodingname]" header and ignore as well any user actions to override the document's encoding. Consequently, * when there is a UTF-8 BOM, then the encoding info provided by HTTP and the user should be treated as irrelevant * the two first steps of the encoding sniffing algorithm must be changed CURRENT STATUS: The encoding sniffing algorithm two first steps give users + transporation layer (HTTP/MIME) power to override a document's character encoding: ]] 1. If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with the confidence certain and abort these steps. 2. If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.[[ HOWEVER, reality is that two mayor user agents operates with an exceptio to the above rules: Whenever the document includes the UTF-8 Byte Order Mark, then Internet Explorer and Webkit - do *not* allow users to override the encoding - do *not* respect the encoding information in the HTTP server's Content-Type header. - do *not* permit their heuristic character dection features to guess any encoding other than UTF-8 Consequently, in IE and Webkit it is impossible for the user - as well as for a HTTP server - to cause a document with the UTF-8 Byte Order Mark to be intepreted as e.g. KOI8-R encoded or Windows-1252 encoded. In contrast, Firefox and Opera - *do* obey the HTTP server's Content-Type header also when ther is a UTF-8 BOM - *do* allow users to override the encoding also when ther is a UTF-8 BOM - *do* permit their heuristic character dection features to guess an encoding other than UTF-8 (Opera and Firefox allow their users to tune/fiddle with how their heuristic encoding sniffing work.) Consequently, in Firefox and Opera it is *possible* for the user - as well as for a HTTP server - to cause a document with the UTF-8 Byte Order Mark to be intepreted as e.g. KOI8-R encoded or Windows-1252 encoded BENEFITS: A. Harmonization with XML 1.0 Appendix F.2, "Priorities in the Presence of External Encoding Information", which recommends BOM to have higher priority than external encoding information: http://www.w3.org/TR/xml/#sec-guessing-with-ext-info (Opera/Firefox do not yet implement this XML 1.0 recommendation) B. a simple, reliable way to specify the UTF-8 encoding C. FIrefox/Opera converge with IE/Webkit = browsers more interopable: D. security: cameleon documens (where the document gets another and risky interpretation when read as legacy encoding) become more difficult to create E. User experience: less "gibberish" and less "mojibake" for users [*] http://en.wikipedia.org/wiki/Mojibake F. Same as A): Promotes a polyglot way to specify the encoding: the BOM works in both HTML and XML. (The Polyglot spec already says that the UTF-8 BOM is the most polyglot enocoding method.) Other justifications: - Opera Software "We have introduced the BOM as requirement for each source file when we have written the build tools as a simple way to verify that all files are utf8 encoded”. (http://stackoverflow.com/questions/4658985/how-to-keep-the-bom-when-editing-files-in-espresso) NOTES: (1) Browsers tested as part of this bug report: IE8, Safari, Chrome (which shows above described behavior) as well as Opera and Firefox (which do support this behavior). Other browsers, e.g. KHTML, have not been tested. (2) BOM in UTF-16: I have not looked into how BOM in UTF-16 is handled by parsers. (3) For the record: All browsers, including Firefox and Opera, *do* already ignore the META charset *element* whenever ther is a UTF-8 BOM. This bug report says that they should *also* ignore HTTP. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Monday, 6 June 2011 20:02:01 UTC