[Bug 12950] New: Require Byte-Order Mark (BOM) in UTF-8 encoded pages from bugzilla@jessica.w3.org on 2011-06-13 (public-html@w3.org from June 2011)

From: <bugzilla@jessica.w3.org>
Date: Mon, 13 Jun 2011 23:40:58 +0000
To: public-html@w3.org
Message-ID: <bug-12950-2495@http.www.w3.org/Bugs/Public/>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12950

           Summary: Require Byte-Order Mark (BOM) in UTF-8 encoded pages
           Product: HTML WG
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P3
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: xn--mlform-iua@xn--mlform-iua.no
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


PROBLEMS:

 * Unnoticed: If a UTF-8 encoded page is lacking both the BOM and the META
charset element and if that page is served via HTTP without the HTTP
Content-Type: @charset parameter, then UAs are likely to default to the legacy
encoding of the user's locale - such as Windows-1252. This ia problem in
itself, and can go rather unnoticed for the user/author, e.g. if the language
of the author is expressable in ASCII letters.
 * Failing validator: As long as the page does not carry neither BOM nor META
charset element, validators are likely to stamp the page as valid, with zero or
little warning, despite that the Content-Type: charset parameter carries an
incorrect encoding.
    Example:
http://validator.nu/?doc=http%3A%2F%2Fmalform.no%2Ftesting%2Fhtml5%2Fbom%2Fhtm_BOM-less
 * User Agents to a certain degree treat UTF-8 encoded pages via the file://
protocol different from files served via http:// protocol. They may autodetect
the encoding of the file protocol, but be more reluctant to autodetect - or
override the charset of - HTTP. Example: Chrome. Some of these UAs tend to obey
the BOM both via HTTP and via file. 

PROPOSAL: 

 * Spec should say that authoring tools MUST - or at least SHOULD - insert the
BOM in UTF-8 encoded pages.
 * Spec should encourage conformance checkers to recommend the UTF-8 BOM
whenever the checker (via HTTP Content-Type's charset property, the
META@charset element or the validator user's encoding overriding choice)
determines the encoding to be UTF-8.

JUSTIFICATION:

 1. HTTP's priority:
 The UTF-8 BOM would enable conformance checkers to detect whether HTTP charset
propertly is used incorrectly. Because:
 * According to HTML5, the optional @charset property of the HTTP Content-Type:
overrides both the locale default and the META charset element (if there is
one). 
 * But, unless there is a BOM, conformance checkers cannot programmatically
determine whether the page is served with a correct Content-Type: charset
parameter or not. (Because, although it becomes entirely illegible to a human
being, a UTF-8 encoded page that is lacking the BOM, may - technically - also
be parsed in a legacy 8-bit encoding.) [Of course the BOM might also be
incorrect, but typically it is correct.]
 * In contrast, for a UTF-8 encoded page that does have a BOM,  then unless the
page is actually served and parsed as UTF-8, the BOM will count as an illegal
character before the DOCTYPE which, in turn, will trigger quirks-mode.  This is
identical to how the UTF-8 BOM character(s), for XML documents, is/are illegal
befor the <?xml version="1.0" ?> declaration, unless the page is actually
served as UTF-8. A BOM before a the XML declaration if the page is determined
tgo be UTF-8, should make XML parsers display a fatal error.

 2. SIMPLICITY
 * Using an editor which uses the BOM, would be a simplification for the
author: he or she would not not need to specify the META charset element, and
could also drop the charset parameter. 
 * Pages with the BOM could be securerly determined to be UTF-8 encoded, when
stored or moved. In contrast, if the page has no META charset element, which is
fully legal, and also no BOM, then one is left to guess etc.
 * Pages with the BOM would not need the HTTP charset parameter

 3. POLYGLOTNESS
 * The BOM works in both XML and HTML, meaning that the author does not need to
user other means that differs with the mark-up language: XML encoding
declaration, META charset elements etc.)

POSITIVE EFFECTS:
 * Many editors that default to UTF-8, do not default to using the BOM - this
change would encourage them to change.
 * Programmatic detectio of HTTP-mislabeleled UTF-8 pages
 * Would encourage moving to UTF-8
 * Markup language indepeend encoding.
 * We shake off all the myths about the BOM: that it is incompatible with Web
browsers etc. That said, this proposal would also bring attentio to the issue,
and make XML parsers handle the UTF-8 BOM, to the extent that they do not
already handle it.


NOTES:
  * This bug descdribes an alterative to Bug 12897. That is: instead of making
the BOM overriding the HTTP header, as requested by Bug 12987, this bug
suggests to make the UTF-8 BOM recommended. This would have some of the same
effects as bug 12987, without introducing changes to RFC-3023.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Received on Monday, 13 June 2011 23:41:00 UTC