- From: <bugzilla@jessica.w3.org>
- Date: Thu, 15 Dec 2011 01:15:20 +0000
- To: public-html-bugzilla@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15195 Summary: apparently incorrect note about violation of Unicode wrt stripping leading BOM Product: HTML WG Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: HTML5 spec (editor: Ian Hickson) AssignedTo: ian@hixie.ch ReportedBy: glenn@skynav.com QAContact: public-html-bugzilla@w3.org CC: mike@w3.org, public-html-wg-issue-tracking@w3.org, public-html@w3.org Section 8.2.2.3 [1] includes a Note, cited below, that stripping BOM is a violation of Unicode. [1] http://dev.w3.org/html5/spec/Overview.html#preprocessing-the-input-stream "Note: The requirement to strip a U+FEFF BYTE ORDER MARK character regardless of whether that character was used to determine the byte order is a willful violation of Unicode, motivated by a desire to increase the resilience of user agents in the face of naïve transcoders." Firstly, I don't believe stripping BOM in the fashion described here is a violation of any conformance requirement of Unicode. However, if the editor believes this to be the case, then the specific compliance clause of the Unicode Standard Section 3.2 [2] believed to be violated should be cited in the note. [2] http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Note that Unicode Section 16.8 [3], under "Byte Order Mark (BOM): U+FEFF" recommends the removal of a leading BOM: "Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space." [3] http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf This language applies to HTML5 (as a system "that use[s] the byte order mark") whether it is in fact used or not used (on some specific occasion). That is, the language of 16.8 cited above does not say "if a system does not recognize an initial U+FEFF (in some particular case) signals the byte order, then it (the BOM) must not be removed". Regards, Glenn -- Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Thursday, 15 December 2011 01:15:21 UTC