- From: <bugzilla@jessica.w3.org>
- Date: Mon, 13 Jun 2011 23:40:58 +0000
- To: public-html@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12950 Summary: Require Byte-Order Mark (BOM) in UTF-8 encoded pages Product: HTML WG Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P3 Component: HTML5 spec (editor: Ian Hickson) AssignedTo: ian@hixie.ch ReportedBy: xn--mlform-iua@xn--mlform-iua.no QAContact: public-html-bugzilla@w3.org CC: mike@w3.org, public-html-wg-issue-tracking@w3.org, public-html@w3.org PROBLEMS: * Unnoticed: If a UTF-8 encoded page is lacking both the BOM and the META charset element and if that page is served via HTTP without the HTTP Content-Type: @charset parameter, then UAs are likely to default to the legacy encoding of the user's locale - such as Windows-1252. This ia problem in itself, and can go rather unnoticed for the user/author, e.g. if the language of the author is expressable in ASCII letters. * Failing validator: As long as the page does not carry neither BOM nor META charset element, validators are likely to stamp the page as valid, with zero or little warning, despite that the Content-Type: charset parameter carries an incorrect encoding. Example: http://validator.nu/?doc=http%3A%2F%2Fmalform.no%2Ftesting%2Fhtml5%2Fbom%2Fhtm_BOM-less * User Agents to a certain degree treat UTF-8 encoded pages via the file:// protocol different from files served via http:// protocol. They may autodetect the encoding of the file protocol, but be more reluctant to autodetect - or override the charset of - HTTP. Example: Chrome. Some of these UAs tend to obey the BOM both via HTTP and via file. PROPOSAL: * Spec should say that authoring tools MUST - or at least SHOULD - insert the BOM in UTF-8 encoded pages. * Spec should encourage conformance checkers to recommend the UTF-8 BOM whenever the checker (via HTTP Content-Type's charset property, the META@charset element or the validator user's encoding overriding choice) determines the encoding to be UTF-8. JUSTIFICATION: 1. HTTP's priority: The UTF-8 BOM would enable conformance checkers to detect whether HTTP charset propertly is used incorrectly. Because: * According to HTML5, the optional @charset property of the HTTP Content-Type: overrides both the locale default and the META charset element (if there is one). * But, unless there is a BOM, conformance checkers cannot programmatically determine whether the page is served with a correct Content-Type: charset parameter or not. (Because, although it becomes entirely illegible to a human being, a UTF-8 encoded page that is lacking the BOM, may - technically - also be parsed in a legacy 8-bit encoding.) [Of course the BOM might also be incorrect, but typically it is correct.] * In contrast, for a UTF-8 encoded page that does have a BOM, then unless the page is actually served and parsed as UTF-8, the BOM will count as an illegal character before the DOCTYPE which, in turn, will trigger quirks-mode. This is identical to how the UTF-8 BOM character(s), for XML documents, is/are illegal befor the <?xml version="1.0" ?> declaration, unless the page is actually served as UTF-8. A BOM before a the XML declaration if the page is determined tgo be UTF-8, should make XML parsers display a fatal error. 2. SIMPLICITY * Using an editor which uses the BOM, would be a simplification for the author: he or she would not not need to specify the META charset element, and could also drop the charset parameter. * Pages with the BOM could be securerly determined to be UTF-8 encoded, when stored or moved. In contrast, if the page has no META charset element, which is fully legal, and also no BOM, then one is left to guess etc. * Pages with the BOM would not need the HTTP charset parameter 3. POLYGLOTNESS * The BOM works in both XML and HTML, meaning that the author does not need to user other means that differs with the mark-up language: XML encoding declaration, META charset elements etc.) POSITIVE EFFECTS: * Many editors that default to UTF-8, do not default to using the BOM - this change would encourage them to change. * Programmatic detectio of HTTP-mislabeleled UTF-8 pages * Would encourage moving to UTF-8 * Markup language indepeend encoding. * We shake off all the myths about the BOM: that it is incompatible with Web browsers etc. That said, this proposal would also bring attentio to the issue, and make XML parsers handle the UTF-8 BOM, to the extent that they do not already handle it. NOTES: * This bug descdribes an alterative to Bug 12897. That is: instead of making the BOM overriding the HTTP header, as requested by Bug 12987, this bug suggests to make the UTF-8 BOM recommended. This would have some of the same effects as bug 12987, without introducing changes to RFC-3023. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
Received on Monday, 13 June 2011 23:41:00 UTC