- From: Brian Smith <brian@briansmith.org>
- Date: Fri, 29 Feb 2008 05:38:05 -0800
- To: "'HTML WG'" <public-html@w3.org>
Ian Hickson wrote: > > However, when the encoding is UTF-16LE or UTF-16BE (i.e. > > supposed to be signatureless), do we really want to drop > > the BOM silently? Shouldn't it count as a character that > > is in error? > > Do the UTF-16LE and UTF-16BE specs make a leading BOM an error? > > If yes, then we don't have to say anything, it's already an error. > > If not, what's the advantage of complaining about the BOM in > this case? See http://unicode.org/faq/utf_bom.html#28: "In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used." If somebody wants to include a zero-width non-breaking space (ZWNBSP) at the beginning of a stream, they have to use U+2060 WORD JOINER instead. > > Likewise, if an encoding signature BOM has been discarded > > and the first > > logical character of the stream is another BOM, shouldn't that also > > count as a character that is in error? > > The spec says: "Given an encoding, the bytes in the input > stream must be converted to Unicode characters for the tokeniser, as > described by the rules for that encoding, except that leading > U+FEFF BYTE ORDER MARK characters must not be stripped by > the encoding layer." That is wrong. See http://unicode.org/faq/utf_bom.html#38. Only the first character in a stream may be a byte order mark. Otherwise, they are to be treated as a ZWNBSP for backwards compatibility. - Brian
Received on Friday, 29 February 2008 13:38:21 UTC