W3C home > Mailing lists > Public > public-html@w3.org > December 2011

[Bug 15195] New: apparently incorrect note about violation of Unicode wrt stripping leading BOM

From: <bugzilla@jessica.w3.org>
Date: Thu, 15 Dec 2011 01:15:26 +0000
To: public-html@w3.org
Message-ID: <bug-15195-2495@http.www.w3.org/Bugs/Public/>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15195

           Summary: apparently incorrect note about violation of Unicode
                    wrt stripping leading BOM
           Product: HTML WG
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: glenn@skynav.com
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


Section 8.2.2.3 [1] includes a Note, cited below, that stripping BOM is a
violation of Unicode.

[1] http://dev.w3.org/html5/spec/Overview.html#preprocessing-the-input-stream

"Note: The requirement to strip a U+FEFF BYTE ORDER MARK character regardless
of whether that character was used to determine the byte order is a willful
violation of Unicode, motivated by a desire to increase the resilience of user
agents in the face of naïve transcoders."

Firstly, I don't believe stripping BOM in the fashion described here is a
violation of any conformance requirement of Unicode. However, if the editor
believes this to be the case, then the specific compliance clause of the
Unicode Standard Section 3.2 [2] believed to be violated should be cited in the
note.

[2] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Note that Unicode Section 16.8 [3], under "Byte Order Mark (BOM): U+FEFF"
recommends the removal of a leading BOM:

"Systems that use the byte order mark must recognize when an initial U+FEFF
signals the byte order. In those cases, it is not part of the textual content
and should be removed before processing, because otherwise it may be mistaken
for a legitimate zero width no-break space."

[3] http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf

This language applies to HTML5 (as a system "that use[s] the byte order mark")
whether it is in fact used or not used (on some specific occasion). That is,
the language of 16.8 cited above does not say "if a system does not recognize
an initial U+FEFF (in some particular case) signals the byte order, then it
(the BOM) must not be removed".

Regards,
Glenn

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Received on Thursday, 15 December 2011 01:20:22 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:17:42 GMT