RE: BOM (several messages about handling encodings in HTML) from Brian Smith on 2008-02-29 (public-html@w3.org from February 2008)

From: Brian Smith <brian@briansmith.org>
Date: Fri, 29 Feb 2008 05:38:05 -0800
To: "'HTML WG'" <public-html@w3.org>
Message-ID: <003601c87ad8$5279fcb0$6401a8c0@T60>

Ian Hickson wrote:
> > However, when the encoding is UTF-16LE or UTF-16BE (i.e. 
> > supposed to be signatureless), do we really want to drop
> > the BOM silently? Shouldn't it count as a character that
> > is in error?
> 
> Do the UTF-16LE and UTF-16BE specs make a leading BOM an error?
> 
> If yes, then we don't have to say anything, it's already an error.
> 
> If not, what's the advantage of complaining about the BOM in 
> this case?

See http://unicode.org/faq/utf_bom.html#28:

"In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used." 

If somebody wants to include a zero-width non-breaking space (ZWNBSP) at the beginning of a stream, they have to use U+2060 WORD JOINER instead. 

> > Likewise, if an encoding signature BOM has been discarded 
> > and the first 
> > logical character of the stream is another BOM, shouldn't that also 
> > count as a character that is in error?
>
> The spec says: "Given an encoding, the bytes in the input 
> stream must be converted to Unicode characters for the tokeniser, as 
> described by the rules for that encoding, except that leading
> U+FEFF BYTE ORDER MARK characters must not be stripped by
> the encoding layer."

That is wrong. See http://unicode.org/faq/utf_bom.html#38. Only the first character in a stream may be a byte order mark. Otherwise, they are to be treated as a ZWNBSP for backwards compatibility.

- Brian

Received on Friday, 29 February 2008 13:38:21 UTC