BOM and UTF-16LE/BE (was: Re: several messages about handling encodings in HTML) from Martin Duerst on 2008-03-03 (public-html@w3.org from March 2008)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Mon, 03 Mar 2008 17:54:17 +0900
To: Geoffrey Sneddon <foolistbar@googlemail.com>, Ian Hickson <ian@hixie.ch>
Cc: HTML WG <public-html@w3.org>, public-i18n-core@w3.org
Message-Id: <6.0.0.20.2.20080303174245.0aa489b0@localhost>

At 01:09 08/03/01, Geoffrey Sneddon wrote:
>
>
>On 29 Feb 2008, at 01:21, Ian Hickson wrote:

>> On Sat, 26 May 2007, Henri Sivonen wrote:
>>>
>>> The draft says:
>>> "A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present."
>>>
>>> That's reasonable for UTF-8 when the encoding has been established by
>>> other means.
>>>
>>> However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed  
>>> to be
>>> signatureless), do we really want to drop the BOM silently?  
>>> Shouldn't it
>>> count as a character that is in error?
>>
>> Do the UTF-16LE and UTF-16BE specs make a leading BOM an error?

Yes. See below for details.

>> If yes, then we don't have to say anything, it's already an error.
>>
>> If not, what's the advantage of complaining about the BOM in this  
>> case?

The fact that it needs explanation on this list should probably be
taken as a hint that we better say something, or implementers will
easily overlook this.

>I don't see anything making a BOM illegal in UTF-16LE/UTF-16BE, in  
>fact, the only mention I find of it with regards to either in Unicode  
>5.0 is "In UTF-16(BE|LE), an initial byte sequence <(FE FF|FF FE)> is  
>interpreted as U+FEFF zero width no-break space."

That's exactly it. To make it very explicit, there is one codepoint
(U+FEFF) and two functions: BOM and ZWNBSP. What the above says is
that U+FEFF at the start of files marked as UTF-16LE/UTF-16BE is
always ZWNBSP, and therefore is never a BOM. This means that a leading
BOM is forbidden.

If there are HTML files that can start with arbitrary characters, then
it might be okay to have a UTF-16LE or UTF-16BE file start with U+FEFF,
because this can then be interpreted as a ZWNBSP (although a ZWNBSP
at the start of a file doesn't make a lot of sense). If HTML files
have to start with markup, then a UTF-16LE or UTF-16BE HTML file
cannot start with U+FEFF, because a ZWNBSP isn't markup.
(Last time I knew HTML, it had to have at least a <title> element,
so it had to start with markup, but I don't know that is working
out in HTML5.)

Regards,   Martin.

>I suppose the rational given for removing it is the section that  
>follows D101 (e.g., "When converting between different encoding  
>schemes$B)6(BTF-8 byte sequences is not recommended by the Unicode  
>Standard.").
>
>
>--
>Geoffrey Sneddon
><http://gsnedders.com/>
>
>

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Monday, 3 March 2008 08:55:56 UTC