- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Sat, 26 May 2007 00:51:03 +0300
- To: HTML WG <public-html@w3.org>
On May 26, 2007, at 00:32, Henri Sivonen wrote: > The draft says: > "A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present." > > That's reasonable for UTF-8 when the encoding has been established > by other means. > > However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed > to be signatureless), do we really want to drop the BOM silently? > Shouldn't it count as a character that is in error? > > Likewise, if an encoding signature BOM has been discarded and the > first logical character of the stream is another BOM, shouldn't > that also count as a character that is in error? I think I should elaborate that when the encoding is UTF-16 (not UTF-16LE or UTF-16BE), the BOM is gets swallowed by the character decoding layer (in reasonable decoder implementations) and is not returned from the character stream at all. Therefore, on the character level, a droppable BOM only occurs in UTF-8 when the encoding was established by other means. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Friday, 25 May 2007 21:51:18 UTC