W3C home > Mailing lists > Public > public-html-bugzilla@w3.org > December 2011

[Bug 15195] New: apparently incorrect note about violation of Unicode wrt stripping leading BOM

From: <bugzilla@jessica.w3.org>
Date: Thu, 15 Dec 2011 01:15:20 +0000
To: public-html-bugzilla@w3.org
Message-ID: <bug-15195-2486@http.www.w3.org/Bugs/Public/>

           Summary: apparently incorrect note about violation of Unicode
                    wrt stripping leading BOM
           Product: HTML WG
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: glenn@skynav.com
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,

Section [1] includes a Note, cited below, that stripping BOM is a
violation of Unicode.

[1] http://dev.w3.org/html5/spec/Overview.html#preprocessing-the-input-stream

"Note: The requirement to strip a U+FEFF BYTE ORDER MARK character regardless
of whether that character was used to determine the byte order is a willful
violation of Unicode, motivated by a desire to increase the resilience of user
agents in the face of naïve transcoders."

Firstly, I don't believe stripping BOM in the fashion described here is a
violation of any conformance requirement of Unicode. However, if the editor
believes this to be the case, then the specific compliance clause of the
Unicode Standard Section 3.2 [2] believed to be violated should be cited in the

[2] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Note that Unicode Section 16.8 [3], under "Byte Order Mark (BOM): U+FEFF"
recommends the removal of a leading BOM:

"Systems that use the byte order mark must recognize when an initial U+FEFF
signals the byte order. In those cases, it is not part of the textual content
and should be removed before processing, because otherwise it may be mistaken
for a legitimate zero width no-break space."

[3] http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf

This language applies to HTML5 (as a system "that use[s] the byte order mark")
whether it is in fact used or not used (on some specific occasion). That is,
the language of 16.8 cited above does not say "if a system does not recognize
an initial U+FEFF (in some particular case) signals the byte order, then it
(the BOM) must not be removed".


Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Thursday, 15 December 2011 01:15:21 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 20:02:10 UTC