- From: <bugzilla@jessica.w3.org>
- Date: Thu, 15 Dec 2011 01:15:26 +0000
- To: public-html@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15195
Summary: apparently incorrect note about violation of Unicode
wrt stripping leading BOM
Product: HTML WG
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: HTML5 spec (editor: Ian Hickson)
AssignedTo: ian@hixie.ch
ReportedBy: glenn@skynav.com
QAContact: public-html-bugzilla@w3.org
CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
public-html@w3.org
Section 8.2.2.3 [1] includes a Note, cited below, that stripping BOM is a
violation of Unicode.
[1] http://dev.w3.org/html5/spec/Overview.html#preprocessing-the-input-stream
"Note: The requirement to strip a U+FEFF BYTE ORDER MARK character regardless
of whether that character was used to determine the byte order is a willful
violation of Unicode, motivated by a desire to increase the resilience of user
agents in the face of naïve transcoders."
Firstly, I don't believe stripping BOM in the fashion described here is a
violation of any conformance requirement of Unicode. However, if the editor
believes this to be the case, then the specific compliance clause of the
Unicode Standard Section 3.2 [2] believed to be violated should be cited in the
note.
[2] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
Note that Unicode Section 16.8 [3], under "Byte Order Mark (BOM): U+FEFF"
recommends the removal of a leading BOM:
"Systems that use the byte order mark must recognize when an initial U+FEFF
signals the byte order. In those cases, it is not part of the textual content
and should be removed before processing, because otherwise it may be mistaken
for a legitimate zero width no-break space."
[3] http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf
This language applies to HTML5 (as a system "that use[s] the byte order mark")
whether it is in fact used or not used (on some specific occasion). That is,
the language of 16.8 cited above does not say "if a system does not recognize
an initial U+FEFF (in some particular case) signals the byte order, then it
(the BOM) must not be removed".
Regards,
Glenn
--
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Received on Thursday, 15 December 2011 01:20:22 UTC