Re: Leading BOM from Henri Sivonen on 2007-05-25 (public-html@w3.org from May 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sat, 26 May 2007 00:51:03 +0300
To: HTML WG <public-html@w3.org>
Message-Id: <B86A5C89-091E-4C14-99E0-33EFC6918333@iki.fi>

On May 26, 2007, at 00:32, Henri Sivonen wrote:

> The draft says:
> "A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present."
>
> That's reasonable for UTF-8 when the encoding has been established  
> by other means.
>
> However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed  
> to be signatureless), do we really want to drop the BOM silently?  
> Shouldn't it count as a character that is in error?
>
> Likewise, if an encoding signature BOM has been discarded and the  
> first logical character of the stream is another BOM, shouldn't  
> that also count as a character that is in error?

I think I should elaborate that when the encoding is UTF-16 (not  
UTF-16LE or UTF-16BE), the BOM is gets swallowed by the character  
decoding layer (in reasonable decoder implementations) and is not  
returned from the character stream at all. Therefore, on the  
character level, a droppable BOM only occurs in UTF-8 when the  
encoding was established by other means.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 25 May 2007 21:51:18 UTC